Monitoring Microservices with OpenShift, Hawkular Metrics and Grafana

A blog post by Joel Takvorian

grafana | metrics | microservice | openshift | vertx

OpenShift, Hawkular Metrics and Grafana are three great tools that we can combine to build a powerful monitoring system for microservices.

Grafana is the visualization layer that enables building custom and dynamic dashboards.
Hawkular Metrics is the robust and flexible layer that is used for metrics storage and querying.
And OpenShift 3 is the new container platform from Red Hat built on top of Docker and Kubernetes.

Let’s see how to make them work together and visualize our microservices metrics in Grafana.

Preparing OpenShift

I assume you already have access to a running OpenShift Origin (1.0.8 or later), but if you don’t, you can follow the installation steps. Or if you’re a lazy boy/girl like me, you can use MiniShift to get it running very quickly. Just make sure you assign enough memory and CPU resources to the VM; I could get everything running fine with 2 cores and 4GB, of course that will depend on your hardware.

Then follow the origin-metrics installation steps to deploy Hawkular Metrics in OpenShift.

Note that Grafana will not work on an insecure SSL connection to Hawkular. So you will have to provide your own certificates when creating the secret:

$ oc secrets new metrics-deployer hawkular-metrics.pem=/path/to/hm.pem \
hawkular-metrics-ca.cert=/path/to/hm-ca.cert -n openshift-infra

Deploying a Microservice

You will need to deploy some application in OpenShift. For this article I’ll use some parts of the Red Hat Helloworld MSA to deploy "Aloha", and later "Bonjour".

For example:

$ oc login -u ... -p ...
# The project name you choose here matters, because it will be the tenant ID in Hawkular
$ oc new-project test
$ git clone https://github.com/redhat-helloworld-msa/aloha
$ cd aloha/
$ oc new-build --binary --name=aloha -l app=aloha
$ mvn package; oc start-build aloha --from-dir=. --follow
$ oc new-app aloha -l app=aloha,hystrix.enabled=true
$ oc expose service aloha

You can open the OpenShift web console to see when Aloha is correctly deployed. And scale it up to 2 pods. The web console should show something like this:

Setup Hawkular Datasource in Grafana

So now, let’s have a look at Grafana. For this article I’ve installed a recent version (3.1.1) and the Hawkular Datasource plugin (1.0.3).

In Grafana, to configure the Hawkular Datasource, set the URL of the Metrics service in OpenShift, ending with /hawkular/metrics. You should have something similar to https://metrics.192.168.42.63.xip.io/hawkular/metrics. Access mode must be Proxy. The tenant must be the name of the project in OpenShift where we’ve created our sample application; so for my example here, it’s just "test". And finally, put the bearer token for the OpenShift API access. To quickly generate a short-lived token, you can navigate to /oauth/token/request from your OpenShift base path (something like https://192.168.42.63:8443/oauth/token/request). For long-term usage you should rather setup a service account.

Create the Dashboard

Now we can create a new dashboard. And a template variable, which will be the cornerstone of our dynamic dashboards.

Open dashboard templating screen

Create a variable named app, select your Hawkular datasource and type in query: tags/container_name:*

As you can guess, it will search all known values of the tag container_name. Once you’ve typed the query, you should see your application names displayed at the bottom of the page.

Ticking Multi-value and Include All option is recommended as it will enable better filtering.

With this single variable, you will already be able to build a nice dynamic dashboard.

Tip

If you want to see the list of available tags created in OpenShift, you can fetch the metrics definitions with the following command:

curl -X GET https://yourserver/hawkular/metrics/metrics -H "Content-Type: application/json" -H "Hawkular-Tenant: your-tenant" -H "Authorization: Bearer your-bearer-token"

Now, time to see nice charts! In the first row, we create a new panel of type Graph. In General tab, set "Memory usage" as the graph title. In Metrics tab, select your Hawkular datasource. Create a series of type Gauge, searching by tag, and give the following tags:

container_name: $app
descriptor_name: memory/usage

These tagged metrics are provided to Hawkular by OpenShift.

From now on, you can see the memory usage by pod instance and for all your applications. There are 2 series because we scaled the service to 2 pods earlier.

You can edit the Graph Axes to set Y-Unit "Data > Bytes" and Y-min 0. Note that you can also show stacked values from the Display tab. This is useful when you want to see the total amount of memory.

Next, we will show some aggregated stats. But before that, go back editing the Graph panel and, in the General tab, set span to 6.

From the Row menu create a new Singlestat panel. Name it "Average, all pods", span 2, height 100px. In the Metrics tab, select your Hawkular datasource. Search by tags using the same tags as before: container_name: $app, descriptor_name: memory/usage.

As you can see, there’s some new options when querying from a Singlestat panel: Multiple series aggregation and Time aggregation.

When a standard query by tag is submitted to Hawkular Metrics, the response may contain several series of data points, depending on the matching tags. Since we want here to get a single number, we will ask the plugin to perform a two-steps aggregation: a "vertical" aggregation that merges all series into a single one, and an "horizontal" aggregation that extracts a single stat from a time-series.

As you can see, when the Multiple series aggregation is left to None, Grafana reports a Multiple Series Error. This is because the Singlestat panel doesn’t know how to merge multiple series, so let’s ask the Hawkular plugin to do it. Select Sum instead of None, and keep Average for the time aggregation.

Now switch to the Options tab and change Unit to "Data > bytes". Here, you can define thresholds to highlight high memory levels.

Note	Usually, Grafana’s Singlestat panel performs time aggregation by itself, through the value field on the Big value row. But since the Hawkular plugin does it on its own, setting whatever value in this field won’t have any effect.

To finalize this dashboard setup for an application, click on the Singlestat panel title and duplicate 5 times. Edit each of the duplicates with the following names and queries:

"Max, all pods": set Time aggregation to Max
"Live, all pods": set Time aggregation to Live
"Average per pod": set Multiple series aggregation to Average
"Max per pod": set Multiple series aggregation to Average and Time aggregation to Max
"Live per pod": set Multiple series aggregation to Average and Time aggregation to Live

Adding Another Application

Now we have a pretty nice dashboard for tracking memory usage on an application. Let’s see what happens if we add a new application in OpenShift, under the same project.

This time I’ll use Bonjour from Helloworld MSA, which is a Node.js microservice. After adding it to OpenShift, and again scaling it to 2 pods, see how it looks like:

Hmm, interesting. Our panels show new series: two for the Bonjour microservice and one docker-build. The later is caused by the build I triggered when I created Bonjour. The sequence of the events is quite obvious when looking at the graphs. We don’t necessarily want to monitor that, but it’s nice to see how far we can go with Hawkular and OpenShift. Anyway, we can filter it out using the top combo box Application.

But still, this is probably not what we would expect. What happens here is that the $app variable we set up in queries is resolved into as many container_name as there are, and result in the same number of series in a single graph. We can change that behaviour very easily thanks to a nice feature of Grafana: on the existing row, to the left, open the Row editor and in Templating options activate duplication from variable app. Save and refresh the browser.

That’s better! By turning on row duplication based on our variable, Grafana has created 3 rows, and for each one it provides to the Hawkular plugin just one value of $app at a time.

Now we can monitor our microservices quite easily! If we don’t want to see the docker-build instance, just filter it out with the top combo.

We will just add a little enhancement to the dashboard, to make easier to understand which row is for which app. On the first row, add a new Text panel, set its title empty, span 2, height 100px, mode HTML and content:

<center><p style='font-size: 40pt'>$app</p></center>

It will display the microservice name. After some layout arrangement, here is the final result:

You can play around with OpenShift. Scale some pods up and down, and you’ll get the metrics updated in Grafana. Just note that on downscaling, you’ll have to wait a little bit (5 minutes) before seeing the Live metrics on the Singlestat panel being updated. This is because we’re not sure if the absence of data is due to a pod being shut down, or a simple delay between measurements.

It’s Just a Beginning

Thanks to the metrics provided in OpenShift, you can build more elaborate dashboards. Just change the desciptor_name tag and see what’s interesting for you, there are metrics on memory, CPU, network and filesystem.

But that’s just the starting kit! The Hawkular Metrics ecosystem is rich and keeps growing, including a Wildfly agent, a Vert.x plugin, a DropWizard reporter, etc. And if that’s not enough for you, it’s very easy to integrate your own metrics: either through the client libraries or by directly calling the Metrics REST API.

A good practice, when you define your own metrics, is to tag them with some pod-discriminant values. It can be through the environment variables set by Kubernetes/OpenShift, but it could also be the hostname since it’s generated specifically for a pod. With that in mind, you will be able to monitor every part of your microservices architecture.

Tip	You can download this dashboard here on GitHub and import it in Grafana.

Published by Joel Takvorian on 24 October 2016