Metrics Documentation

hawkular-metrics | metrics

Introduction

Hawkular Metrics is a scalable, asynchronous, multi tenant, long term metrics storage engine that uses Cassandra as the data store and REST as the primary interface. This section provides an overview of some of the key features of Hawkular Metrics. The following sections provide in-depth discussions on these as well as other features.

Scalability
Hawkular Metrics is all about scalability. You can run a single instance backed by a single Cassandra node. You can also scale out Cassandra to multiple nodes to handle increasing loads. The Hawkular Metrics server employs a stateless architecture, which makes it easy to scale out as well.

metrics scalability

This diagram illustrates the various deployment options made possible with Hawkular Metrics' scalable architecture. The upper left shows the simplest deployment with a single Cassandra node and single Hawkular Metrics node.

The bottom right picture shows that it is possible to run more Hawkular Metrics nodes than Cassandra nodes. In reality this scenario may be somewhat unlikely; however, this example. The Cassandra node is running on a machine with long term persistent storage, but the Hawkular Metrics nodes are running in containers that can come or go at any time. Running multiple nodes provides fault tolerance.

The upper right picture has a single Hawkular Metrics node with multiple Cassandra nodes. As load increases and there is a need to scale out from the simple deployment in the upper left, this is the next logical step. Because of its asynchronous, stateless design, a single Hawkular Metrics node can handle large numbers of requests. As the Cassandra cluster expands that single Hawkular Metrics instance will be able to handle a larger load because it distributes the requests across Cassandra nodes.

The bottom right picture has multiple Cassandra nodes as well as multiple Hawkular Metrics nodes. Multiple Hawkular Metrics nodes are deployed in order to distribute load and to provide fault tolerance. Note that Hawkular Metrics itself does not provide load balancing. A separate load balance would have to be put in front of the Hawkular Metrics nodes.

REST
JSON over REST is the primary interface of Hawkular Metrics. This makes it easier for users to get started and also makes integration easier since REST+JSON is widely used and easily understood. a rich, growing set of features that includes:

Multi Tenancy
Hawkular Metrics provides virtual multi tenancy. All data is mapped to a tenant, but the data on disk is not physically partitioned by tenant. From an API perspective though, everything is partitioned by tenant. All requests, both reads and writes, must include a tenant id.

Metric Types
Hawkular Metrics supports common metric types including:

  • gauges

  • counters

  • rates

Tagging
Hawkular Metrics provides flexible tagging support that makes it easy to organize and group data. Tagging can also be used to provide additional information and context about data.

Querying
Hawkular Metrics offers a rich set of features around querying that are ideal for rendering data in graphs and in charts. This includes:

  • Filtering and grouping with tags

  • Searching metric definitions

  • Downsampling and aggregation

  • Limit and order results

  • Stacking

Automatic Data Removal
Each metric or time series can be thought of as a continuous stream of data. In systems like this, deleting and purging data presents some challenges due to the potentially unbounded data growth. Hawkular Metrics makes it much more manageable by providing automatic data deletion and removal.

Tenants

All data is partitioned by tenant. Data is not physically partitioned on disk. The partitioning happens at the API level. This means that a metric cannot exist on its own outside of a tenant. Let’s first look at how tenants are created.

Creating Tenants

Tenants are created in one of two ways. First, a tenant can be created implicitly by simply inserting metric data. Clients can immediately start storing data without first creating a tenant.

Implicit tenant creation
curl -X POST http://server/hawkular/metrics/gauges/raw -d @request.json \
-H "Content-Type: application/json" -H "Hawkular-Tenant: com.acme"

This is a request to insert gauge data points for the com.acme tenant. If that tenant does not already exist, it will be request when storing the metric data. Specific details on inserting data can be found in Inserting Data.

Tenants can also be created explicitly.

Explicit tenant creation
curl -X POST http://server/hawkular/metrics/tenants -d '{"id": "com.acme"}'
-H "Content-Type: application/json"

The request body is pretty simple. It only requires an id property.

There is an important distinction between the two ways of creating tenants. The /tenants endpoint checks to see if a tenant with the specified id already exists. If one does, Hawkular Metrics returns an error response with a 409 status code.

Tenant Header

As previously stated all data is partitioned by tenant. Hawkular Metrics enforces this by requiring the Hawkular-Tenant HTTP header in requests. The value of the header is the tenant id. We saw this already with the implicit tenant creation. The /tenants endpoint is one exception in that it does not require the header.

Tenant Ids

A tenant has an id that uniquely identifies it. The id is a variable length, UTF-8 encoded string. Hawkular Metrics does not perform any validation checks to prevent duplicate ids. This is in large part due to Cassandra’s design. Among other things, Cassandra is a key/value store. Inserting a row into Cassandra is similar to inserting an entry into a map. If the key already exists in the map, it will simply be overwritten with the new value. This is exactly how Cassandra behaves.

If a duplicate id is used, data will be silently overwritten. Users are responsible for ensuring that tenant ids are unique.

Metrics

A metric represents a single time series that can be thought of as a continuous stream of data points. We will get into the details of data points in Inserting Data. For now, it is sufficient to know that a data point consists of a timestamp and a value.

The terms metric, metric definition, and time series will be interchangeably throughout the documentation.

This section discusses metric types, metric ids, and metric creation.

Metric Types

Three types of metrics are currently supported:

  • Availability

  • Gauge

  • Counter

  • String

Availability

Represents the availability of a resource such as host machine (physical or virtual) or an application server. There are only three supported availability types or values:

  • up

  • down

  • unknown

Availability is stored as single, unsigned byte.

Gauge

Has a numeric value that can fluctuate, going up or down. Some examples of gauges include,

  • Available heap space in the JVM

  • Number of active HTTP sessions on a web server

  • Disk space used by a database

  • Execution time for a REST API call

With each of these examples, values can increase or decrease. In some instances, like JVM heap space, there are well-defined bounds for the possible values; however, that is not always the case.

A gauge value is stored as a 64-bit floating point number.

Counter

Has a numeric value that monotonically increases or decreases. Some examples include:

  • Total number of requests to a REST endpoint

  • Total number of request timeouts for a Cassandra node

  • Total number of request timeouts for a Cassandra cluster

These examples involve values that are always increasing. Note however that counter can also be decreasing.

A counter value is stored as a 64-bit signed long.

There are two types of counters commonly uses with time series databases (TSDBs). One stores the current count or total with each data point. The other stores the delta or increment with each data point. The former is more commonly used with counters that can easily be maintained by the client. Tracking the total number of requests to a REST endpoint for a specific server can be done easily by the client. Tracking the total number of requests for the endpoint across all servers however is more challenging. This can be done more easily by storing the deltas and allowing the TSDB to compute and maintain the total count.

Hawkular Metrics only supports the former in which each data point represents the total count; however, we can easily simulate counters that store deltas using gauges.

String

The String metric type stores any arbitrary strings. Hawkular Metrics already has an Availability metric type, but it is limited to a predefined number of values which cannot easily be changed. In some cases a String type with arbritrary values would better fit for availability events. It can also be used for storing and tracking other types of events.

Note that there is currently a 2 KB limit for string data points. This limitation may be configurable in the future.

Rate

A rate is a derived metric whose values are computed from counter or gauge data points. Rate data points can be retrieved for any counter or gauge. They are represented as 64-bit floating point numbers.

Rate data points are not persisted. They are computed at query time.

Metric Ids

Every metric has an id that uniquely identifies it. The id consists of three parts - the tenant id, the metric type, and the metric name. The tenant id is a variable length, UTF-8 encoded string. The metric type is stored as a one byte integer. The metric name is stored as a variable length, UTF-8 encoded string.

The parts that comprise the metric id provide namespacing. A metric name only has to be unique for the metric type and the tenant. For example, suppose we have a tenant id of com.acme. The com.acme tenant could have a gauge named http_request_time and also have a counter named http_request_time.

Creating Metrics

Just like tenants, metrics can be created implicitly while inserting data points. They can also be created explicitly. Let’s first look at the implicit approach.

Implicit gauge creation
curl -X POST http://server/hawkular/metrics/gauges/http_request_time/raw -d @request.json \
-H "Content-Type: application/json" -H "Hawkular-Tenant: com.acme"

This is a request to insert gauge data points for http_request_time under the com.acme tenant. The metric definition will be created if it does not already exist. The details on inserting data are covered in Inserting Data.

Here are example for implicitly creating counter and availability metrics.

Implicit counter creation
curl -X POST http://server/hawkular/metrics/counters/http_requests/raw -d @request.json \
-H "Content-Type: application/json" -H "Hawkular-Tenant: com.acme"
Implicit availability creation
curl -X POST http://server/hawkular/metrics/availability/http_server/raw -d @request.json \
-H "Content-Type: application/json" -H "Hawkular-Tenant: com.acme"

Now let’s look at the alternative approach for creating metrics.

Explicit gauge creation
curl -X POST http://server/hawkular/metrics/gauges -d '{"id": "http_request_time"}' \
-H "Content-Type: application/json" -H "Hawkular-Tenant: com.acme"

The request body is pretty simple. It only requires an id property. Creating counter and availability metrics is pretty similar.

Explicit counter creation
curl -X POST http://server/hawkular/metrics/counters -d '{"id": "http_requests"}' \
-H "Content-Type: application/json" -H "Hawkular-Tenant: com.acme"
Explicit availability creation
curl -X POST http://server/hawkular/metrics/availability -d '{"id": "http_server"}' \
-H "Content-Type: application/json" -H "Hawkular-Tenant: com.acme"

There is an important distinction between the two ways of creating metrics. The /gauges, /counters, and /availability endpoints check to see if a metric with the specified id already exists. If one does, Hawkular Metrics returns an error response with a 409 status code.

Identifiers

All identifiers are stored as variable length, UTF-8 encoded strings. This includes:

  • Tenant ids

  • Metric names (see Metric Ids section below for more details on metric names

  • Tag keys (for both metric and data point tags)

At present there is no restriction on characters that can be used in identifiers. This may change in the future though (See HWKMETRICS-208 for details). For this reason it is recommended to restrict the characters to letters, numbers, underscore, period, and forward slash.
If an identifier uses a character that is defined as special character in the HTTP spec, it must be encoded. Forward slashes are no exception. If for example I have a tenant id of com/acme, then in HTTP requests it should be encoded as com%2Facme.

Inserting Data

Inserting data is a synchronous operation with respect to the client. An HTTP response is not returned until all data points are inserted. On the server side however, multiple inserts to the database are done in parallel to achieve higher throughput.

Data Points

A data point in Hawkular Metrics is a tuple that in its simplest form consists of a timestamp and a value. The value of a data point will vary depending on the metric type. Timestamps are unix timestamps in milliseconds. All

Examples

There are several operations available for inserting data points.

Gauge Data

Insert data points for a single gauge
curl -X POST http://server/hawkular/metrics/gauges/request_size/raw -d @request.json \
-H "Content-Type: application/json" -H "Hawkular-Tenant: com.acme"
request.json
[
  {"timestamp": 1460413065369, "value": 3.14},
  {"timestamp": 1460413025569, "value": 4.57},
  {"timestamp": 1460111065369, "value": 5.056}
]

The gauge name is request_size and the endpoint is /hawkular/metrics/gauges/$metric/raw. The value of the timestamp property should be a unix timestamp.

Insert data points for multiple gauges
curl -X POST http://server/hawkular/metrics/gauges/raw -d @request.json \
-H "Content-Type: application/json" -H "Hawkular-Tenant: com.acme"
request.json
[
  {
    "id": "free_memory",
    "data": [
      {"timestamp": 1460111065369, "value": 2048},
      {"timestamp": 1460151065369, "value": 2012}
    ]
  },
  {
    "id": "used_memory",
    "data": [
      {"timestamp": 1460111065369, "value": 2048},
      {"timestamp": 1460151065369, "value": 2075}
    ]
  }
]

The request body is a bit more complex. Each array element is an object that has id and data properties. data contains an array of data points.

Counter Data

Insert data points for a single counter
curl -X POST http://server/hawkular/metrics/counters/total_requests/raw -d @request.json \
-H "Content-Type: application/json" -H "Hawkular-Tenant: com.acme"
request.json
[
  {"timestamp": 1460413065369, "value": 69},
  {"timestamp": 1460413025569, "value": 65},
  {"timestamp": 1460111065369, "value": 51}
]
Insert data points for multiple counters
curl -X POST http://server/hawkular/metrics/counters/raw -d @request.json \
-H "Content-Type: application/json" -H "Hawkular-Tenant: com.acme"
request.json
[
  {
    "id": "page_views",
    "data": [
      {"timestamp": 1460111065369, "value": 238},
      {"timestamp": 1460151065369, "value": 254}
    ]
  },
  {
    "id": "error_count",
    "data": [
      {"timestamp": 1460111065369, "value": 12},
      {"timestamp": 1460151065369, "value": 17}
    ]
  }
]

Availability Data

Insert data points for a single availability
curl -X POST http://server/hawkular/metrics/availability/server1/raw -d @request.json \
-H "Content-Type: application/json" -H "Hawkular-Tenant: com.acme"
request.json
[
  {"timestamp": 1460413065369, "value": "down"},
  {"timestamp": 1460413025569, "value": "down"},
  {"timestamp": 1460111065369, "value": "up"}
]
Insert data points for multiple availabilities
curl -X POST http://server/hawkular/metrics/availability/raw -d @request.json \
-H "Content-Type: application/json" -H "Hawkular-Tenant: com.acme"
request.json
[
  {
    "id": "server1",
    "data": [
      {"timestamp": 1460111065369, "value": "up"},
      {"timestamp": 1460151065369, "value": "up"}
    ]
  },
  {
    "id": "server2",
    "data": [
      {"timestamp": 1460111065369, "value": "unknown"},
      {"timestamp": 1460151065369, "value": "up"}
    ]
  }
]

Mixed Data

curl -X POST http://server/hawkular/metrics/metrics/data -d @request.json \
-H "Content-Type: application/json" -H "Hawkular-Tenant: com.acme"
request.json
{
  "gauges": [
    {
      "id": "free_memory",
      "data": [
        {"timestamp": 1460111065369, "value": 2048},
        {"timestamp": 1460151065369, "value": 2012}
      ]
    },
    {
      "id": "used_memory",
      "data": [
        {"timestamp": 1460111065369, "value": 2048},
        {"timestamp": 1460151065369, "value": 2075}
      ]
    }
  ],
  "counters": [
    {
      "id": "page_views",
      "data": [
        {"timestamp": 1460111065369, "value": 238},
        {"timestamp": 1460151065369, "value": 254}
      ]
    },
    {
      "id": "error_count",
      "data": [
        {"timestamp": 1460111065369, "value": 12},
        {"timestamp": 1460151065369, "value": 17}
      ]
    }
  ],
  "availabilities": [
    {
      "id": "server1",
      "data": [
        {"timestamp": 1460111065369, "value": "up"},
        {"timestamp": 1460151065369, "value": "up"}
      ]
    },
    {
      "id": "server2",
      "data": [
        {"timestamp": 1460111065369, "value": "unknown"},
        {"timestamp": 1460151065369, "value": "up"}
      ]
    }
  ]
}

Failures

If there is an error inserting a data point, the operation is aborted and any data in the request not yet written into the database will be ignored. When there is an error, there is no reliable way to determine the remaining data points that still need to be persisted. This is due to the fact that writes to the database are asynchronous and are done in parallel. This means data points will not necessarily be written in the order received.

Unless stated otherwise, it can be assumed that writes in Hawkular Metrics are idempotent as is the case with writing data points. If there is an error writing data points, the client can simply retry the request.

Data Retention and Removal

Metric data is automatically deleted from the system after an amount of time that is determined by data retention settings. Data retention can be specified at various levels and is specified in days. There is a system-wide default of seven days. This setting will apply to all metrics in the system if no other settings are specified. The system-wide setting can be overridden at start up by either setting the hawkular.metrics.default-ttl system property or by setting the DEFAULT_TTL environment variable.

Data retention can also be set per tenant. To do this, you need to explicitly create the tenant as in the following example.

curl -X POST http://server/hawkular/metrics/tenants -d @request.json \
-H "Content-Type: application/json"
request.json
{
  "id": "com.acme",
  "retentions": {
    "gauge": 10,
    "counter": 5,
    "availability": 8
  }
}

This example uses the curl shell command. The request body is put in a file to improve readability. The retentions map consists of names of one or more metric types. The value of each is an integer which represents the data retention for that metric type in days.

You can also set data retention at the individual metric level. This would override any tenant data retention as well as the system-wide default. Here is an example.

curl -X POST http://server/hawkular/metrics/metrics -d @request.json \
-H "Content-Type: application/json" -H "Hawkular-Tenant: com.acme"
request.json
{
  "id": "request_size",
  "dataRetention": 10
}

This request creates a gauge named request_size with a data retention of 10 days.

Hawkular Metrics currently lacks APIs for changing data retention. See HWKMETRICS-380 for details.

TODO Add section on how Cassandra handles deletes. (Actually a separate page with some basic info on Cassandra administration might be good)

Tagging

Tags in Hawkular Metrics are key/value pairs. Tags can be applied to a metric to provide meta data for the time series as a whole. Tags can also be applied to individual data points. Tags can be used to perform filtering in queries.

Creating Metric Tags

Create gauge with tags
curl -X POST http://server/hawkular/metrics/gauges -d @request.json \
-H "Content-Type: application/json" -H "Hawkular-Tenant: com.acme"
request.json
{
  "id": "request_size",
  "tags": {
    "datacenter": "dc1",
    "env": "stage"
    "units": "bytes"
  }
}
Create counter with tags
curl -X POST http://server/hawkular/metrics/counters -d @request.json \
-H "Content-Type: application/json" -H "Hawkular-Tenant: com.acme"
request.json
{
  "id": "request_count",
  "tags": {
    "datacenter": "dc1",
    "env": "stage"
    "units": "bytes"
  }
}
Create availability with tags
curl -X POST http://server/hawkular/metrics/availability -d @request.json \
-H "Content-Type: application/json" -H "Hawkular-Tenant: com.acme"
request.json
{
  "id": "server1",
  "tags": {
    "datacenter": "dc1",
    "env": "stage"
  }
}

Updating Metric Tags

These endpoints are used to add or replace tags.

Update gauge tags
curl -X PUT http://server/hawkular/metrics/gauges/request_size/tags -d @request.json \
-H "Content-Type: application/json" -H "Hawkular-Tenant: com.acme"
request.json
{
  "datacenter": "dc2",
  "hostname": "server1"
}
Update counter tags
curl -X PUT http://server/hawkular/metrics/counters/request_count/tags -d @request.json \
-H "Content-Type: application/json" -H "Hawkular-Tenant: com.acme"
request.json
{
  "datacenter": "dc2",
  "hostname": "server1"
}
Update availability tags
curl -X PUT http://server/hawkular/metrics/availability/server1/tags -d @request.json \
-H "Content-Type: application/json" -H "Hawkular-Tenant: com.acme"
request.json
{
  "datacenter": "dc2",
  "hostname": "server1"
}

Deleting Metric Tags

Delete gauge tags
curl -X DELETE http://server/hawkular/metrics/gauges/request_size/tags/env,status
-H "Content-Type: application/json" -H "Hawkular-Tenant: com.acme"

The request specifies a comma-delimited list of tag names. This request deletes the tags named env and status.

Delete counter tags
curl -X DELETE http://server/hawkular/metrics/counters/request_count/tags/env,status
-H "Content-Type: application/json" -H "Hawkular-Tenant: com.acme"
Delete availability tags
curl -X DELETE http://server/hawkular/metrics/availability/server1/tags/env,status -d @request.json \
-H "Content-Type: application/json" -H "Hawkular-Tenant: com.acme"

Data Point Tags

Tags can be added to individual data points. They are a bit different than metric tags because they are immutable. Tags cannot be added or updated after a data point is written. The following examples demonstrate how to add tags to data points.

Add gauge data points with tags
curl -X POST http://server/hawkular/metrics/gauges/raw -d @request.json \
-H "Content-Type: application/json" -H "Hawkular-Tenant: com.acme"
request.json
[
  {
    "id": "request_size",
    "data": [
      {
        "timestamp": 1460111065369,
        "value": 2048
        "tags": {
          "clientId": "1234",
          "zone": "us-east-1"
        }
      },
      {
        "timestamp": 1460151065369,
        "value": 2012,
        "tags": {
          "clientId": "5678",
          "zone": "us-west-1"
        }
      }
    ]
  },
  {
    "id": "request_time",
    "data": [
      {
        "timestamp": 1460111065369,
        "value": 2048,
        "tags": {
          "clientId": "1234",
          "zone": "us-east-1"
        }
      },
      {
        "timestamp": 1460151065369,
        "value": 2075,
        "tags": {
          "clientId": "5678",
          "zone": "us-west-1"
        }
      }
    ]
  }
]
Add counter data points with tags
curl -X POST http://server/hawkular/metrics/counters/raw -d @request.json \
-H "Content-Type: application/json" -H "Hawkular-Tenant: com.acme"
request.json
[
  {
    "id": "request_count",
    "data": [
      {
        "timestamp": 1460111065369,
        "value": 2048
        "tags": {
          "clientId": "1234",
          "zone": "us-east-1"
        }
      },
      {
        "timestamp": 1460151065369,
        "value": 3107,
        "tags": {
          "clientId": "5678",
          "zone": "us-west-1"
        }
      }
    ]
  },
  {
    "id": "request_timeouts",
    "data": [
      {
        "timestamp": 1460111065369,
        "value": 11,
        "tags": {
          "clientId": "1234",
          "zone": "us-east-1"
        }
      },
      {
        "timestamp": 1460151065369,
        "value": 15,
        "tags": {
          "clientId": "5678",
          "zone": "us-west-1"
        }
      }
    ]
  }
]
Add availability data points with tags
curl -X POST http://server/hawkular/metrics/availability/raw -d @request.json \
-H "Content-Type: application/json" -H "Hawkular-Tenant: com.acme"
request.json
[
  {
    "id": "server1",
    "data": [
      {
        "timestamp": 1460111065369,
        "value": "up"
        "tags": {
          "clientId": "1234",
          "zone": "us-east-1"
        }
      },
      {
        "timestamp": 1460151065369,
        "value": "up",
        "tags": {
          "clientId": "5678",
          "zone": "us-west-1"
        }
      }
    ]
  },
  {
    "id": "server2",
    "data": [
      {
        "timestamp": 1460111065369,
        "value": "down",
        "tags": {
          "clientId": "1234",
          "zone": "us-east-1"
        }
      },
      {
        "timestamp": 1460151065369,
        "value": "down",
        "tags": {
          "clientId": "5678",
          "zone": "us-west-1"
        }
      }
    ]
  }
]

Tag Filtering

Hawkular Metrics provides a mini tag filtering expression language that is available in several query APIs. It has a number of features including:

  • Search by tag key only, ignoring the value

    • Only exact match searches are supported for tag keys

  • Exact match search by key and value

  • Search for any number of tag values, i.e., logical OR

  • Regular expression support in tag value

  • Negation in tag value

  • Compound search filter

The remainder of this section provides several examples that illustrate the aforementioned features. Examples of how tag filtering is supported in various APIs can be found in Querying.

Expression

Example

Description

tag_name:*

zone:*

Search for tag named zone having any value.

tag_name:value

zone:us-east-1

Search for tag named zone having value us-east-1.

tag_name:value1|value2

zone:us-east-1|us-west-1

Search for tag named zone having a value of either us-east-1 or us-west-1.

tag_name:!value

zone:!us-east-1

Search for tag named zone with any value except us-east-1.

tag_name:regex

hostname:.*01

Search for tag named hostname with a value that ends with 01.

tag_name:value,tag_name:value

zone:us-east-1,hostname:dbserver01

Search for tag named zone with value us-east-1 and tag named hostname with value dbserver01.

tag_name:value,tag_name:value1|value2

zone:us-east1,server:server01|server02

Search for tag named zone with value us-east-1 and tag named server having a value of either server01 or server01.

Querying

The examples provided in the following sections are not an exhaustive listing of the full API. For a complete reference see the complete REST API documentation.

Metric Definitions

These operations do not fetch data points but rather the metric definition itself.

Query for Metrics of specific type

Fetch gauge definitions
curl -X GET http://server/hawkular/metrics/gauges \
-H "Content-Type: application/json" -H "Hawkular-Tenant: com.acme"

The response body will look something like,

[
  {
    "tenantId": "com.acme",
    "id": "gauge_1"
  },
  {
    "tenantId": "com.acme",
    "id": "gauge_2",
    "dataRetention": 20
  },
  {
    "tenantId": "com.acme",
    "id": "gauge_3",
    "dataRetention": 15,
    "tags": {
      "datacenter": "dc1",
      "hostname": "server01"
    }
  }
]

gauge_1 has neither any tags nor data retention defined. It uses the tenant data retention. If that is not defined, it uses the system default. gauge_2 has its own data retention of 20 days. gauge_3 has a data retention of 15 days and also defines some tags.

Tag filter queries can be used to filter the list of metrics returned.

Fetch counter definitions using tag filters
curl -X POST http://server/hawkular/metrics/counters?tags=zone:us-west-1,kernel_version=4.0.9 \
-H "Content-Type: application/json" -H "Hawkular-Tenant: com.acme"

Query Across All Metric Types

You can query across all metric types. The next example illustrates the type parameter which filters the results by the specified types.

Fetch all metric definitions
curl -X POST http://server/hawkular/metrics/metrics \
-H "Content-Type: application/json" -H "Hawkular-Tenant: com.acme"
response body
[
  {
    "tenantId": "com.acme",
    "id": "gauge_1"
    "type": "gauge"
  },
  {
    "tenantId": "com.acme"
    "id": "gauge_2",
    "type": "gauge"
    "dataRetention": 20
  },
  {
    "tenantId": "com.acme",
    "id": "request_count",
    "type": "counter"
  },
  {
    "tenantId": "com.acme",
    "id": "request_timeouts",
    "type": "counter",
    "dataRetention": 20
  }
]

The next example demonstrates querying across all metric types and filtering the results using tag filters.

Fetch all metric definitions with tag filters
curl -X POST http://server/hawkular/metrics/metrics?tags=zone:us-west-1,kernel_version=4.0.9 \
-H "Content-Type: application/json" -H "Hawkular-Tenant: com.acme"

Raw Data

The simplest form of querying for raw data points does not require any parameters and returns a list of data points. This API is available for each metric type.

Simple request to fetch gauge data points
curl -X GET http://server/hakwular/metrics/gauges/request_size/raw \
-H "Content-Type: application/json" -H "Hawkular-Tenant: com.acme"
Response with gauge data points
[
  {"timestamp": 1460413065369, "value": 3.14},
  {"timestamp": 1460212025569, "value": 4.57},
  {"timestamp": 1460111065369, "value": 5.056}
]
Simple request to fetch counter data points
curl -X GET http://server/hakwular/metrics/counters/request_count/raw \
-H "Content-Type: application/json" -H "Hawkular-Tenant: com.acme"
Response with counter data points
[
  {"timestamp": 1460413065369, "value": 7},
  {"timestamp": 1460212025569, "value": 11},
  {"timestamp": 1460111065369, "value": 19}
]
Simple request to fetch availability data points
curl -X GET http://server/hakwular/metrics/availability/server1/raw \
-H "Content-Type: application/json" -H "Hawkular-Tenant: com.acme"
response with availability data points
[
  {"timestamp": 1460413065369, "value": "up"},
  {"timestamp": 1460212025569, "value": "up"},
  {"timestamp": 1460111065369, "value": "down"}
]

Date Range

Every query is bounded by a start and an end time. The end time defaults to now, and the start time defaults to 8 hours ago. These can be overridden with the start and end parameters respectively. The expected format of their values is a unix timestamp. The start of the range is inclusive while the end is exclusive.

Override start and end times for gauge
curl -X GET "http://server/hawkular/metrics/gauges/request_size/raw?start=1235&end=6789" \
-H "Content-Type: application/json" -H "Hawkular-Tenant: com.acme"
Override start and end times for counter
curl -X GET "http://server/hawkular/metrics/counters/request_count/raw?start=1235&end=6789" \
-H "Content-Type: application/json" -H "Hawkular-Tenant: com.acme"
Override start and end times for availability
curl -X GET "http://server/hawkular/metrics/availability/server1/raw?start=1235&end=6789" \
-H "Content-Type: application/json" -H "Hawkular-Tenant: com.acme"

If the end time is greater than the start time, an error response will be returned with a 400 status code.

Sort Order

Data is sorted by timestamp and returned in sorted order by default. The order is specified with the order parameter. Accepted values are asc and desc. The parameter value is case-insensitive.

Return results in ascending order for a gauge
curl -X GET http://server/hawkular/metrics/gauges/request_size?order=ASC \
-H "Content-Type: application/json" -H "Hawkular-Tenant: com.acme"
Return results in ascending order for a counter
curl -X GET http://server/hawkular/metrics/counters/request_count?order=ASC \
-H "Content-Type: application/json" -H "Hawkular-Tenant: com.acme"
Return results in ascending order for an availability
curl -X GET http://server/hawkular/metrics/availability/server1?order=ASC \
-H "Content-Type: application/json" -H "Hawkular-Tenant: com.acme"

Limiting Results

By default there is no limit on the number of data points returned. The limit parameter will limit the number of data points returned.

Limit results for gauge
curl -X GET http://server/hawkular/metrics/gauges/request_size?limit=100 \
-H "Content-Type: application/json" -H "Hawkular-Tenant: com.acme"
Limit results for counter
curl -X GET http://server/hawkular/metrics/counters/request_count?limit=100 \
-H "Content-Type: application/json" -H "Hawkular-Tenant: com.acme"
Limit results for availability
curl -X GET http://server/hawkular/metrics/availability/server1?limit=100 \
-H "Content-Type: application/json" -H "Hawkular-Tenant: com.acme"

Counter Rates

Often times with counters, particularly with rendering graphs, we are more interested in rates. Hawkular Metrics generates rate data points on the server side, freeing the client from that work. This is done at query time by simply calculating the delta between raw counter data points. The result is multiplied by a factor of 60,000 milliseconds in order to give us a per-minute rate.

Suppose we have the following counter data points:

Table 1. Counter data points

Timestamp

Value

60000

0

90000

200

210000

400

300000

550

To fetch the rates for the counter:

Fetch rate data points
curl -X GET http://server/hawkular/metrics/counters/request_count/rate
Counter rates
[
  {"timestamp": 90000, "value": 400.00},
  {"timestamp": 210000, "value": 100.00},
  {"timestamp": 300000, "value": 100.00}
]

Note that the values are returned as floating point numbers.

Counter Resets

Sometimes there are events which occur counters to reset. For instance, suppose we are tracking the total number of requests to a server since start up. Whenever the server is restarted, we will have a reset event. Hawkular Metrics detects a reset event whenever a counter value is less than the previous value. If resets are not handled, they can cause inconsistencies in graphs. Hawkular Metrics handles resets during rate calculations by excluding the data point where the reset is detected. Let’s illustrate this with an example.

Table 2. Counter data points with a reset event

Timestamp

Value

60000

0

90000

200

210000

130

300000

180

A reset event occurs some time between 90000 and 210000; consequently, we will get back the following rate data points:

Table 3. Rate data points with reset

Timestamp

Value

90000

400

300000

33.33

Note that we exclude the rate data point between 90000 and 210000 timestamps.

Downsampling

Downsampling is a query technique for reducing the number of data points that are sent back to the client. Why is this done? When a request is made to render a graph, the client specifies a date range. The number of data points that fall within that range can and will vary. We want to avoid sending back too many data points because an excessive number of data points does little to improve the visualization, slows down the rendering, and makes the UI less responsive which in turn makes the user experience worse overall. Downsampling is a way to return a predictable or fixed number of data points which facilitates better graphs and a better overall user experience.

Hawkular Metrics provides several /stats endpoints that use downsampling. These endpoints are available for all metric types. Examples are provided in Querying Stats.

Buckets

Data points are first grouped into buckets. A bucket can have zero or more data points, and a data point will be in at most one bucket. Aggregation functions are then applied to the data points in each bucket to produce a single, bucketed data point.

Let’s look at a simple example to illustrate how data points are grouped.

Table 4. Data points

Data point

Timestamp

P1

15:00

P2

15:10

P3

15:20

P4

15:30

P5

15:40

P6

15:50

We have six data points. The values are irrelevant for the example. We query with a date range of 15:00 to 16:00. We use four buckets to end up with:

Table 5. Buckets

Bucket

Data points

15:00 - 15:15

P1, P2

15:15 - 15:30

P3

15:30 - 15:45

P4, P5

15:45 - 16:00

P6

The first thing to note is that a bucket expressed as a date range or duration in which the start time is inclusive and the end time is exclusive. If a data point’s timestamp falls within that range, then the data point is grouped into that bucket. Different aggregation functions are applied depending on the metric type.

Bucket Query Parameters

There are two query parameters that are available with all /stats endpoints - buckets and bucketDuration. One and only one of them can be specified in a request. For the preceding example, we could end up with four buckets using either one these parameters.

buckets specifies the exact number of buckets to use. For the preceding example, buckets=4 will divide the time range into four buckets. A higher value increases the number of buckets which in turn reduces the number of data points per bucket.

bucketDuration is a duration specified in one of:

  • milliseconds

  • seconds

  • minutes

  • hours

  • days

The value must match the regular expression (+d)(ms|s|mn|h|d).

For the preceding example, bucketDuration=900000ms specifies a duration of 900,000 milliseconds or 15 minutes to yield four buckets.

Alternatively, we could do bucketDuration=900s which is 900 seconds or 15 minutes.

We could also do bucketDuration=15mn which is 15 minutes.

Suppose our date range spanned a 7 day period and we want a bucket per day. We could accomplish this with bucketDuration=24h which is 24 hours or 1 day. Alternatively we could do bucketDuration=1d which is 1 day.

A larger duration results in fewer buckets with more data points per bucket. A smaller duration results in more buckets with less data points per bucket.

Numeric Bucket Data Points

Numeric bucket data points are used with gauges, counters, and rates. When data points are grouped into a bucket, several aggregation functions are applied to produce a data point that consists of a number of statistics.

Numeric bucket data point
{
  "start": 12345,
  "end": 6789,
  "empty": false,
  "min": 100.01,
  "avg": 107.5,
  "max": 115.32,
  "median": 109.0,
  "sum": 215.0,
  "samples": 5
}

start and end correspond to the bucket’s start and end times respectively.

empty is a boolean flag that indicates whether or not the bucket has any data points in it. We will see an example of an empty bucket next.

The min, max, avg, median, and sum properties should be self-explanatory. They hold the results of the aggregation functions applied over all the data points in the bucket.

samples is the total number of data points in the bucket.

The properties in a numeric data point are fixed and are the same for gauges, counters, and rates.

In the future, Hawkular Metrics may allow the client to specify which aggregation functions to use in the bucket data points. See HWMKETRICS-374 for details.

Now let’s see what an empty bucket data point looks like.

Empty numeric bucket data point
{
  "start": 12345,
  "end": 6789,
  "empty": true,
}

The empty property is true indicating that there were no data points in the bucket. Note that the statistics related properties are excluded when the bucket is empty.

A bucket data point can also have an optional set of percentiles.

Bucket data point with percentiles
{
  "start": 12345,
  "end": 6789,
  "empty": false,
  "min": 100.01,
  "avg": 107.5,
  "max": 115.32,
  "median": 109.0,
  "sum": 215.0,
  "percentiles": [
    {
      "quantile": 0.90,
      "value": 100.01
    },
    {
      "quantile": 0.95
      "value": 108.42
    },
    {
      "quantile": 0.99
      "value": 115.25
    }
  ]
  "samples": 5
}

This data point includes the 90th, 95th, and 99th percentiles. Unless the request explicitly asks for percentiles, they will be omitted. See this example below to see how the percentiles query parameter is used..

Querying Stats

This section provides examples of all the /stats endpoints for the different metric types.

Querying Gauges

Fetch gauge stats using buckets parameter
curl -X GET http://server/hawkular/metrics/gauges/request_size/stats?start=1235&end=6789&buckets=60 \
-H "Content-Type: application/json" -H "Hawkular-Tenant: com.acme"

This request queries a gauge named request_size and specifies that 60 buckets be used. An array of numeric bucket data points is returned.

Fetch gauge stats using bucketDuration parameter
curl -X GET http://server/hawkular/metrics/gauges/request_size/stats?start=1235&end=6789&bucketDuration=60000ms \
-H "Content-Type: application/json" -H "Hawkular-Tenant: com.acme"

This request uses the bucketDuration parameter and specifies that each bucket is a minute wide.

The next example demonstrates the percentiles query parameter.

Fetch gauge stats that include percentiles
curl -X GET http://server/hawkular/metrics/gauges/request_size/stats?start=1235&end=6789&buckets=30&percentiles=75,90,99 \
-H "Content-Type: application/json" -H "Hawkular-Tenant: com.acme"

The percentiles parameter takes a comma-delimited list of numeric values in which each value must be between 0 and 100.

You can also query across multiple gauges. The set of metrics to query is determined by using either tag filters or by specifying a list of metric names.

Fetch stats from multiple gauges by name
curl -X GET http://server/hawkular/metrics/gauges/stats?start=12345&end=56789&buckets=100&metrics=G1&metrics=G2&metrics=G3

This request fetches data points from gauges G1, G2, and G3. The only difference from previous examples is that each bucket will contain data points from multiple metrics.

Next we use tag filters to select the set of metrics to query.

Fetch stats from gauges using tag filters
curl -X GET http://server/hawkular/metrics/gauges/stats?start=1235&end=6789&buckets=30&tags=hostname:server1 \
-H "Content-Type: application/json" -H "Hawkular-Tenant: com.acme"

Querying Counters

Now we look at the /stats endpoints for counter which are virtually the same as those for gauges.

Fetch counter stats using buckets parameter
curl -X GET http://server/hawkular/metrics/counters/total_requests/stats?start=1235&end=6789&buckets=60 \
-H "Content-Type: application/json" -H "Hawkular-Tenant: com.acme"

This request queries a counter named total_requests and specifies that 60 buckets be used. An array of numeric bucket data points is returned.

Fetch counter stats using bucketDuration parameter
curl -X GET http://server/hawkular/metrics/counters/total_requests/stats?start=1235&end=6789&bucketDuration=60s \
-H "Content-Type: application/json" -H "Hawkular-Tenant: com.acme"

This request uses the bucketDuration parameter and specifies that each bucket is a minute wide.

Fetch counter stats that include percentiles
curl -X GET http://server/hawkular/metrics/counters/total_requests/stats?start=1235&end=6789&buckets=30&percentiles=75,90,99 \
-H "Content-Type: application/json" -H "Hawkular-Tenant: com.acme"

You can also query across multiple counter. The set of metrics to query is determined by using either tag filters or by specifying a list of metric names.

Fetch stats from multiple counters by name
curl -X GET http://server/hawkular/metrics/counters/stats?start=12345&end=56789&buckets=100&metrics=C1&metrics=C2&metrics=C3

This request fetches data points from counters C1, C2, and C3. The only difference from previous examples is that each bucket will contain data points from multiple metrics.

Next we use tag filters to select the set of metrics to query.

Fetch stats from counters using tag filters
curl -X GET http://server/hawkular/metrics/counters/stats?start=1235&end=6789&buckets=30&tags=hostname:server1 \
-H "Content-Type: application/json" -H "Hawkular-Tenant: com.acme"

Querying Counter Rates

Downsampling can be done with rates as well.

Fetch rates stats using buckets parameter
curl -X GET http://server/hawkular/metrics/counters/total_requests/rate/stats?start=1235&end=6789&buckets=60 \
-H "Content-Type: application/json" -H "Hawkular-Tenant: com.acme"

This request queries the rate for a counter named total_requests and specifies that 60 buckets be used. An array of numeric bucket data points is returned.

Fetch rate stats using bucketDuration parameter
curl -X GET http://server/hawkular/metrics/counters/total_requests/rate/stats?start=1235&end=6789&bucketDuration=1mn \
-H "Content-Type: application/json" -H "Hawkular-Tenant: com.acme"

This request uses the bucketDuration parameter and specifies that each bucket is a minute wide.

Fetch rate stats that include percentiles
curl -X GET http://server/hawkular/metrics/total_requests/rate/stats?start=1235&end=6789&buckets=30&percentiles=75,90,99 \
-H "Content-Type: application/json" -H "Hawkular-Tenant: com.acme"

You can also query for rates across multiple counter. The set of metrics to query is determined by using either tag filters or by specifying a list of metric names.

Fetch rate stats from multiple counters by name
curl -X GET http://server/hawkular/metrics/counters/rate/stats?start=12345&end=56789&buckets=100&metrics=C1&metrics=C2&metrics=C3

This request fetches rate data points from counters C1, C2, and C3. The only difference from previous examples is that each bucket will contain data points from multiple metrics.

Next we use tag filters to select the set of metrics to query.

Fetch rate stats from counters using tag filters
curl -X GET http://server/hawkular/metrics/counters/rate/stats?start=1235&end=6789&buckets=30&tags=hostname:server1 \
-H "Content-Type: application/json" -H "Hawkular-Tenant: com.acme"

Availability Bucket Data Points

Availability bucket data points are used with availability metrics. When data points are grouped into a bucket, several aggregation functions are applied to produce a data point that consists of several of statistics.

Availability bucket data point
{
  "start": 12345,
  "end": 67890,
  "empty": false,
  "downtimeDuration": 29311,
  "lastDowntime": 12367,
  "uptimeRatio": 0.78,
  "downtimeCount": 12
}

start and end correspond to the bucket’s start and end times respectively.

empty is a boolean flag that indicates whether or not the bucket has any data points in it. We will see an example of an empty bucket next.

downtimeDuration is the total time in milliseconds that the metric was reported down. Note that this is the total time within the bucket’s start and end times.

lastDowntime is the last time within the bucket’s time range that the metric was reported down. The value is in milliseconds.

uptimeRatio is basically a percentage of the time for the duration of the bucket that the metric is up. The value will be a floating point number between zero and one.

downtimeCount is the number of periods in which a resource is reported down. In this context a period is a range of consecutive data points in which the availability does not change. For example, if a resource reports down twice in a row, then up, and then down again, downtimeCount will be 2.

Now let’s look at what an empty data point looks like.

Empty availability bucket data point
{
  "start": 12345,
  "end": 67890,
  "empty": true,
}

Note that the statistics related properties are omitted when the bucket is empty.

Querying Availability

Fetch availability stats using buckets parameter
curl -X GET http://server/hawkular/metrics/gauges/server1/stats?start=1235&end=6789&buckets=60 \
-H "Content-Type: application/json" -H "Hawkular-Tenant: com.acme"

This request queries an availability metric named server1 and specifies that 60 buckets be used. An array of availability bucket data points is returned.

Fetch availability stats using bucketDuration parameter
curl -X GET http://server/hawkular/metrics/availability/server1/stats?start=1235&end=6789&bucketDuration=60s \
-H "Content-Type: application/json" -H "Hawkular-Tenant: com.acme"

This request uses the bucketDuration parameter and specifies that each bucket is a minute wide.

There is currently no API for fetching bucket data points across multiple availability metrics.

Alerting

Hawkular Metrics includes Hawkular Alerting. This allows users to quickly leverage their metric data with robust alerting.

Triggers can be defined to act on both the incoming stream of metrics (near real time) or via queries of the persisted metric data. This combination of stream and query-based alerting can be powerful.

It may be useful to get familiar with the capabilities and terminology of Hawkular Alerting before getting into the details of using it with Hawkular Metrics. For more see the following:

Stream-Based Alerting

This approach defines triggers using Alerting’s built-in condition types and evaluates the conditions on the incoming metric data. It provides the fastest possible issue detection. For example, given a response time gauge metric named MyAppResponseTime, a trigger could be defined with Threshold condition (MyAppResponseTime > 750). There are many condition types and trigger options but the main point here is that the threshold test is evaluated on incoming metric data, as it is being persisted.

MetricId Prefixes

There is one technical note. To use Hawkular Alerting with Hawkular Metrics there is a naming convention when defining trigger conditions. For a metric with name 'X', the alerting DataId to reference it will be 'prefix_X', where the prefix depends on the metric’s type. For example, the MyAppResponseTime metric, used above, is a 'gauge'. The 'hm_g' prefix would be used. So, the actual condition would be defined as (hm_g_MyAppResponseTime > 750).

This is done because Hawkular Metrics allows the same metric name for different types, and Hawkular Alerting then needs the prefix to uniquely identify the correct metric.

Table 6. Metric Prefixes
Prefix Metric Type

hm_a

availability

hm_c

counter

hm_cr

counter rate

hm_g

gauge

hm_gr

gauge rate

hm_s

string

Metrics has some special support for stream-based alerting, see Metrics Group Triggers for the details.

Query-Based Alerting

Whereas stream-based alerting is applied to metric data as it arrives, it is often the case that issues can be detected by looking at historical behavior of the persisted data. Hawkular Metrics provides its own Alerter to provide this capability. An Alerter extends Hawkular Alerting by allowing external condition evaluators. In this case, the Metrics Alerter performs queries looking for user-defined conditions, and when found it then informs Hawkular Alerting, which in turn may fire a trigger and generate an alert.

The mechanism is the same as when defining a stream-based trigger, except in this case the trigger uses an ExternalCondition. The ExternalCondition supplies an expression string that is understood and processed by the alerter. The metrics alerter defines a robust language for defining the external condition expression string.

Expression Language

The expression language allows the user to define one or more queries which can then be incorporated into a flexible eval expression. For example, let’s say we want to detect an increase in response time day over day, using the MyAppResponseTime metric. We’d define two queries, let’s call them qNow and q1d. qNow will query for the most recent MyAppResponseTime data and q1d will query the for data 1 day back. Additionally, our queries would define the duration, let’s say one hour. So, to detect a 25% increase in the average response time for the most recent hour, compared to the average for the same hour yesterday, we could define an eval expression like:

  ( q(qNow,avg) > ( 1.25 * q(q1d,avg) ) )

The eval expressions are built on EvalEx and support many operators and built-in functions. See EvalEx for a complete description of supported operators, functions and constants.

The query variables are of the format:

  q(<queryName>,<function>)

Each query variable will be replaced by the actual data fetched from Hawkular Metrics, and then the eval expression will be evaluated. If the evaluation returns true then Hawkular Alerting is informed such that the ExternalCondition matches, potentially firing a trigger.

The queryName refers to one of the queries defined in the expression. The functions available depend on the MetricType of the data being queried, and are analogous to the /stats endpoints in the Metrics REST API.

Table 7. Query Functions
MetricType Available Functions

Gauge, Gauge Rate, Counter, Counter Rate

avg, max, median, min, percentiles, samples, sum

Availability

notUpCount, notUpDuration, samples, upCount, uptimeRatio

String

Not supported by the Alerter, use Stream-Based conditions

Percentiles

The percentiles available depend on the percentiles requested in the query definition. By default no percentiles are available. But, requested percentiles can be referenced in the query function. For example, the 90th percentile would be referenced like this:

  q(qNow,%90)
Other Examples

If 90th percentile > twice the median

  ( q(qNow,%90) > ( 2 * q(qNow,median) ) )

If < 2 heartbeat avails have been reported in the last minute

  ( q(qNow,upCount) < 2 )

ConditionExpression

The ConditionExpression fully defines the condition to be evaluated. It is defined as follows and is supplied as a JSON string (the ExternalCondition’s expression value must be a String).

ConditionExpression
  queries        List<Query>
  frequency      duration string (see below), time between query-executions/evals
  eval           string, expression determining condition match
  evalType       string, one of [ALL (default), EACH]. See more below.
  quietCount     int, >= 0, default=0. See more below.


Query
  name           string, for referencing the query in the eval
  type           string, optional, target metric type for query (default=gauge)
  metrics        Set<String> of metricIds (see more below)
                   required if tags not specified, otherwise not permitted
  tags           string, a metrics 'tag query expression' (see more below)
                   required if metrics not specified, otherwise not permitted
  percentiles    Set<String>, optional, requested percentiles (e.g. {"90","95"})
  duration       duration string (see below)
                   defines length of time range for data to be queried
  offset         duration string (see below), optional
                   defines time range offset, Default is no offset.
                     - start = now - offset - duration
                     - end = start + duration
EvalType

ConditionExpression.evalType determines whether the eval expression is performed on ALL or EACH of the metrics. The Query.metrics and Query.tags fields define the set of metrics that will be involved in the query. The query will result in a set of data points for each metric. This field defines how to perform the ConditionExpression.eval for the resulting data points.

ALL: The data points for all of the metrics will be combined, then aggregated, and ConditionExpression.eval is resolved one time (per run of the ConditionExpression). At most one Event will be sent to Alerting if ConditionExpression.eval resolves to true.

EACH: The data points for each metric will be aggregated separately. ConditionExpression.eval is resolved N times, once for each metric. Up to N Events could be sent to Alerting depending on how often ConditionExpression.eval resolves to true.

For example, assume we are dealing with two metrics, M1 and M2, the eval is q(qNow,avg) > 100, and qNow is define with an interval of 1h. On each run we get the M1 and M2 data points for the most recent hour. Using ALL we combine all of the datapoints, generate the average, and resolve the expression. If true we send an Event to alerting. Using EACH we keep the data points for each metric separate, generate the average for each, and resolve the expression for each. For each true resolution we send an Event to Alerting. Note that to distinguish the events we provide the metric name in the Event.context. For example: context={"MetricName":"M1"}.

A note about using EACH. If multiple queries are involved in the eval expression then that only metrics common to each query will be evaluated. For example, if the eval above were q(qNow,avg) > q(qYesterday,avg) and only M1 were common to both queries then the eval would only be resolved using M1 data points. In general each query will likely use the same metrics so this issue is a corner case.

The default is ALL.

QuietCount

Setting ConditionExpression.evalType=EACH means the expression is evaluated for each metric targeted by the query. It allows one trigger definition to apply to many different metrics, which can be powerful. But the downside is that using trigger options like autoDisable or AutoResolve becomes difficult. If the trigger fires because of metric A, and then disables, it may miss issues fo metric B. Autoresolve has the same problem, it is hard to apply due to the variety of metrics that may have fired the trigger.

Although not a replacement for the options mentioned above, setting ConditionExpression.quietCount can be useful. If set > 0 it suppresses redundant firing for the same metric until that metric has returned to a steady state. Put another way, if set to N it means: after the trigger fired metric A, stay quiet (no firing for A) until after we see N false evaluations for metric A. The count is tracked independently per metric, so a firing for metric A does not affect metric B.

Metrics and Tags

The Query.metrics and Query.tags fields are used to specify the metrics involved in the query, and therefore the data being retrieved. One of the fields must be specified. They are mutually exclusive, so specifying both is an error.

Duration

Fields specified as a Duration follow the standard Duration format.

Events Sent to Alerting

When ConditionExpression.eval resolves to true the Alerter sends an Event to Alerting. The Event is designed to match the ExternalCondition from which the expression was built:

Table 8. Event
Field Value

tenantId

ExternalCondition’s tenantId (from the owning Trigger)

id

A generated UUID

ctime

current time in ms

dataId

ExternalCondition’s dataId

category

"MetricsCondition"

text

Values used in eval (e.g. "{q(qNow,avg)=61.0}")

context

"MetricName":<EvalMetricName> (Map entry provided when EvalType=EACH)

The Events are for matching only and are not persisted in the Alerting database. Although, for any generated Alert the the Event will be provided in the Alert.conditionEvals in order to help explain the Alert.

Generating the JSON

The ConditionExpression JSON can be generated by hand, but it may be easier to utilize the provided Java classes. (TODO: Determine where these classes should be provided and provide Maven dependency). To generate the JSON construct the ConditionExpression and then call:

  ConditionExpression.toJson()

The following Code snippet from AlerterITest.groovy is an example:

Query qNow = new Query("qNow", Collections.singleton(metricId), null, null, ResultType.combined, "1mn", null);
Query q1d = new Query("q1d", "1d", qNow);
ConditionExpression ce = new ConditionExpression( Arrays.asList(qNow, q1d), "1h",
            "((q(qNow,avg) - q(q1d,avg)) / q(q1d,avg)) > 0.25" )
println( JsonOutput.prettyPrint( ce.toJson()) )

Resulting in:

{
    "queries": [
        {
            "name": "qNow",
            "type": "gauge",
            "metrics": [
                "alerts-test-avgd-1484757539171"
            ],
            "tags": null,
            "percentiles": [
                "90"
            ],
            "duration": "1h",
            "offset": null
        },
        {
            "name": "q1d",
            "type": "gauge",
            "metrics": [
                "alerts-test-avgd-1484757539171"
            ],
            "tags": null,
            "percentiles": [
                "90"
            ],
            "duration": "1h",
            "offset": "1d"
        }
    ],
    "frequency": "30mn",
    "evalType": "ALL",
    "eval": "(q(qNow,avg) > ( 1.25 * q(q1d,avg) ))"
}

Metrics Group Triggers

Metrics Group Triggers (MGT) is an extension of the group trigger feature in Hawkular Alerting. It provides a way to define a single trigger and have it apply to an unknown and changing set of metrics. The single trigger is knows a Group Trigger and it acts like a template for a set of Member Triggers. For example, let’s say you want to fire an alert whenever FreeSpace < 10% on any file system. We’d like to define a single trigger and have it apply to every file system. We can do this with a Metrics Group Trigger.

The MGT could look something like this (as an hAlerting full trigger JSON):

{
"trigger": {
  "type": "GROUP",
  "tenantId": "MyProject",
  "id": "MGT-FreeSpace",
  "name": "MGT FreeSpace",
  "description": "An alert to notify admins that disk space is running low.",
  "autoDisable": true,
  "enabled": true,
  "tags": {
    "HawkularMetrics": "GroupTrigger",
    "DataIds": "FreeSpace",
    "FreeSpace": "name = FreeSpace",
    "SourceBy": "*"
  }
},
"conditions": [
  {
    "type": "THRESHOLD",
    "dataId": "FreeSpace",
    "operator": "LT",
    "threshold": 10
  }
]
}

Pay careful attention to the tagging on the trigger, it’s the tags that make this a Metrics Group Trigger.

  • "HawkularMetrics": "GroupTrigger"

    • This tag identifies the group trigger as an MGT to be processed by the hMetrics alerter.

  • "DataIds": "FreeSpace"

    • This tag declares the dataIds used in the group trigger conditions.

    • In our example we have only one condition using only one dataId, "FreeSpace".

  • "FreeSpace": "name = FreeSpace"

    • This tag provides the metrics tag query that identifies the set of metrics to use for member trigger creation.

    • In this example each metric tagged with {"name": "FreeSpace"} will be used.

    • See Tag Filtering for more details.

  • "SourceBy": "*"

    • This tag is used to make ensure member triggers use related metrics. It is described in more detail below.

    • When using a single dataId it will commonly be set to "*" because relating multiple metrics is not applicable.

Tagging

To use Metrics Group Triggers there is necessary tagging to be applied not only the MGT itself but in most cases the metric definitions also require tags. This section discusses both the trigger and metric tagging.

Metric Group Trigger Tagging

Each Metrics Group Trigger has several tags to guide its handling:

Tag Name Tag Value Notes

"HawkularMetrics"

"GroupTrigger"

Required to be processed as Metrics Group Trigger

"DataIds"

DataId1..DataIdN

The DataIds used in the trigger conditions, required

DataId1

tag query

Metrics tag query for dataId1 metrics

…​

…​

…​

DataIdN

tag query

Metrics tag query for dataId1 metrics

"SourceBy"

TagName1..TagNameN .

see SourceBy below

SourceBy

The SourceBy tag specifies the metric tag names used to determine the member trigger population. As an example, consider an MGT with one CompareCondition: HeapUsed > 80% HeapMax. We want to compare HeapUsed to HeapMax but of course each comparison should be done on metrics from the same machine. We don’t want to compare all HeapUsed to all HeapMax for every machine in the data center. We would tag our MGT something like this:

"HawkularMetrics" : "GroupTrigger"
"DataIds"         : "HeapUsed,HeapMax"
"HeapUsed"        : name = "HeapUsed"
"HeapMax"         : name = "HeapMax"
"SourceBy"        : Machine

This would result in one member trigger per machine, for machine names common to HeapUsed and HeapMax metrics. If we had the following metrics in the database, tagged as specified:

|machine0|HeapUsed    {name=HeapUsed, Machine=machine0}
|machine0|HeapMax     {name=HeapMax , Machine=machine0}
|machine1|HeapUsed    {name=HeapUsed, Machine=machine1}
|machine1|HeapMax     {name=HeapMax , Machine=machine1}
|machine2|HeapUsed    {name=HeapUsed}
|machine2|HeapMax     {name=HeapMax}
|machine3|HeapUsed    {name=HeapUsed, Machine=machine3}
|machine4|HeapMax     {name=HeapMax , Machine=machine4}

It would result in two member triggers, one each testing:

|machine0|HeapUsed < 80% |machine0|HeapMax
|machine1|HeapUsed < 80% |machine1|HeapMax

We would not have a 3rd member trigger because machine2 metrics don’t have the necessary tags, machine3 does not have the HeapMax metric, and machine4 does not have the HeapUsed metric.

SourceBy is required, but can be set to '*' to generate member triggers for all value combinations (not recommended when multiple dataIds are involved).

Metric Definition Tagging

Although technically not required it is generally the case that to effectively use MGTs some level of tagging will be needed on the metric defintions.

DataId Tag Query Tags

Each MGT declares one or more DataIds in its conditions. Each of those DataIds is to be replaced by an actual metric name in a member trigger. The set of metrics applied to the DataId is determined via the taq query defined for the DataId in its GMT tag. In many cases the convention will be for this to be a generic 'name' tag, but the query can be as complex as needed, as allowed by the hMetrics tag query language.

SourceBy Tags

When an MGT uses multiple DataIds the metric names replacing them will typically be related, or from the same source. To allow that to happen the metric definitions will need to be tagged such that the source information is available. For example, a member trigger likely wants to be applied to data from the same machine. If we have N machines in M datacenters, we would need each metric tagged with datacenter=<someDataCenter> and machine=<someMachine>. Then we can declare SourceBy:dataCenter,machine.

Configuration

These system properties can be defined to affect the MGT handling.

System Property Default Description

hawkular-metrics.alerter.mgt.pool-size

20

Thread pool size for handling MGT refresh. Each MGT has it’s own thread.

hawkular-metrics.alerter.mgt.job-period-seconds

3600 (one hour)

The number of seconds between MGT refresh. Each refresh determines the current metrics via the tag queries, and then updates the member trigger set accordingly.

Security

TODO

redhatlogo-white

© 2016 | Hawkular is released under Apache License v2.0