Red Hat

Hawkular Alerting for Developers

Hawkular Alerting Developer Guide

Introduction

Hawkular Alerting is a component of the Hawkular management and monitoring project. It’s goal is to provide flexible and scalable alerting services in an easily consumable way.

The Hawkular Alerting project lives on GitHub.

Alerting Philosophy

Alerting is useful, necessary, and typically an integral part of operational sanity. Done well it strikes the perfect balance between human intervention and automation. Done poorly it is an ineffective nuisance. Hawkular Alerting tries to provide the tools to do things well, but it can just as easily be abused. Alerts bring attention to a problem, or developing problem. That problem typically requires human intervention to resolve. As best as possible the alert should represent high level symptoms that affect a user experience. The number of generated alerts should be small because a human can only respond to a few situations daily. The same alert should likely not be repeated, or repeated only if a response has not been initiated.

Alerts

Alerts are generated when an Alert Trigger fires, based on a set of defined conditions that have been matched, possibly more than once or have held true over a period of time. When fired the trigger can perform actions, typically but not limited to notifications (e-mail, sms, etc). Alerts then start moving through the Open, Acknowledged, Resolved life-cycle. There are many options on triggers to help ensure that alerts are not generated too frequently, including ways of automatically disabling and enabling the trigger.

Events

As discussed above the number of alerts should be small in order to be manageable. But it can be useful to capture interesting happenings in the monitored world. This is called an Event in Hawkular Alerting. An event can be roughly thought of as an alert without lifecycle. Like alerts, an event can be generated by a trigger but unlike an alert, it can also be injected directly via the API, so it is very easy for clients to insert events as desired. And although trigger-generated events can define actions to be performed, in general an event does not need human intervention. Instead, it is typically something that can contribute to an alert firing, help investigate an alert, or simply help understand system behavior.

It is expected that the number of Events can be very large compared to the number of Alerts. Events (like Alerts), can be flexibly queried.

Triggers

A Trigger defines the conditions that when satisfied will cause the trigger to fire an Alert. Triggers can have one or more conditions and can optionally fire when ANY or ALL of the conditions are met. A trigger can generate an Alert or an Event.

Conditions

There are several different kinds of conditions but they all have one thing in common, each requires some piece of data against which the condition is evaluated. Here are the different kinds of conditions:

  • Threshold

    • X < 10, X >= 20

  • ThresholdRange

    • X inside [10,20), X outside [100,200]

  • Compare

    • X < 80% Y

  • String

    • X starts with "ABC", X matches "A.*B"

  • Rate

    • X > 10 per-minute

  • Availability

    • X is DOWN

  • Event

    • event.id starts 'IDXYZ', event.tag.category == 'Server', event.tag.from ends '.com'

  • Missing

    • No event/data received for X in the last 5 minutes

Most conditions deal with numeric data. But String and Availability data is also supported. A trigger can combine conditions dealing with data of different types and from different sources.

Actions

The whole purpose of alerting is to be able to immediately respond to a developing or active problem. Hawkular Alerting provides several plugins to take action when alerts are generated. Custom action plugins can be defined as well. The list of provided action plugins keeps growing, Here is a sample:

  • E-mail notification

  • SMS notification

  • SNMP notification

  • Pager Duty integration

  • Aerogear integration

  • File-system notification

  • Webhook notification

Trigger Dampening

It’s often the case that you don’t want a trigger to fire every time a condition set is met. Instead, you want to ensure that the issue is not a spike of activity, or that you don’t flood an on-call engineer with alerts. Hawkular Alerting provides several way of ensuring triggers fire only as desired. We call this "Trigger Dampening". An example is useful for understanding dampening.

Let’s say we have a trigger with a single condition: responseTime > 1s.

It is important to understand how the reporting interval plays into alerting, and into dampening. Assume responseTime is reported every 15s. That means we get roughly 4 data points every minute, and therefore evaluate the condition around 4 times a minute.

Here are the different trigger dampening types:

Strict

  • N consecutive true evaluations

  • Useful for ignoring spikes in activity or waiting for a prolonged event

In our example this could be, "Fire the trigger only if responseTime > 1s for 6 consecutive evaluations". So, given a 15s reporting interval this means response time would likely have been high for about 90s. But note that if the reporting interval changes the firing time will change. This is used more when the number of evaluations is more important than the time it takes to fire.

Note that default dampening for triggers is Strict(1). Which just means that by default a trigger fires every time it’s condition set evaluates to true.

Relaxed Count

  • N true evaluations out of M total evaluations

  • Useful for ignoring short spikes in activity but catching frequently spiking activity

In our example this could be, "Fire the trigger only if responseTime > 1s for 4 of 8 evaluations". This means the trigger will fire if roughly half the time we are exceeding a 1s response time. Given a 15s reporting interval this means the trigger could fire in 1 to 2 minutes of accumulated evaluations. But note that if the reporting interval changes the firing time will change. This is used more when the number of evaluations is more important than the time it takes to fire.

Relaxed Time

  • N true evaluations in T time

  • Useful for ignoring short spikes in activity but catching frequently spiking activity

In our example this could be, "Fire the trigger only if responseTime > 1s 4 times in 5 minutes". This means the trigger will fire if we exceed 1s response time multiple times in a 5 minute period. Given a 15s reporting interval this means the trigger could fire in 1 to 5 minutes of accumulated evaluations. But note that if the reporting interval changes the firing time will change. And also note that the trigger will never fire if we don’t receive at least 4 reports in the specified 5 minute period. This is used when you don’t want to exceed a certain period of time before firing.

Strict Time

  • Only true evaluations for at least T time

  • Useful for reporting a continued aberration

In our example this could be, "Fire the trigger only if responseTime > 1s for at least 5 minutes". This means the trigger will fire if we exceed 1s response time on every report for a 5 minute period. Given a 15s reporting interval this means the trigger will fire after roughly 20 consecutive true evaluations. Note that if the reporting interval changes the firing time will remain roughly the same. It is important to understand that at least 2 evaluations are required. The first true evaluation starts the clock. Any false evaluation stops the clock. Assuming only true evaluations, the trigger fires on the first true evaluation at or after the specified period. The shorter the reporting interval the closer the firing time will be to the specified period, T.

Strict Timeout

  • Only true evaluations for T time

  • Useful for reporting a continued aberration with a more guaranteed firing time

In our example this could be, "Fire the trigger only if responseTime > 1s for 5 minutes". This means the trigger will fire if we exceed 1s response time on every report for a 5 minute period. Given a 15s reporting interval this means the trigger will fire after roughly 20 consecutive true evaluations. Note that if the reporting interval changes the firing time will remain the same. It is important to understand that only 1 evaluation is required. The first true evaluation starts the clock. Assuming only true evaluations, the trigger fires at T, when a timer expires and fires the trigger. Any false evaluation stops the clock and cancels the timer. This type of dampening has more processing overhead because the trigger evaluation requires an external timer.

AutoDisable

A trigger can be set for AutoDisable. Whereas dampening can limit the firing rate of a trigger, disabling a trigger completely stops the trigger from firing (or being evaluated). A trigger can be manually enabled and disabled, via the REST API, but it can also be disabled automatically. If the trigger has the autoDisable option set to true then after it fires it id disabled, preventing any subsequent alerts until manually re-enabled. The default is false.

AutoEnable

A trigger can be set for AutoEnable. If AutoEnable is true then when an alert is resolved, and if all alerts for the trigger are then resolved, the trigger will be enabled if it is currently disabled. This ensures that the trigger will again go into firing mode, without needing to be manually enabled by the user. The default is false.

Source

By default both Triggers and Data ignore "source". This means that the dataIds defined on a trigger’s conditions are matched against the dataIds on incoming data (within a tenant) and matching data is evaluated against the conditions. It is possible to qualify triggers and data with a "source" such that a trigger only evaluates data having the same source.

This mechanism is used automatically by Data-Driven Group Triggers but can be used manually as well. If you find that data is better described using a combination source+id, as opposed to just id, then this approach may be appropriate.

Group Triggers

It’s often the case that the same alerting needs to be applied to all instances of the same thing. For example, it may be useful to alert on "System Load > 80%" on 50 different CPUs. It can be cumbersome to manage 50 individual triggers.

A Group Trigger allows you to define a single trigger and then apply it to a group of logically similar things. A group trigger could be used in the example above. Then, a member could be added for each CPU. The member triggers are basically managed copies of the group trigger. Changes at the group level are pushed down to the members. So, to change "80%" to "85%", or to change autoDisable from false to true, only the group trigger must be changed.

Managing DataIds

The group trigger is basically a template, it is not deployed. Only the member triggers are deployed and actively evaluated because only the member triggers are associated with real dataIds on the conditions. The group trigger uses "tokens" for the dataIds and each member, when defined, must provide a map of dataId token replacements.

Using the example above, our group trigger would define a condition using a dataId token, like:

{ type: "threshold",
  dataId: "SystemLoad",
  operator: "GT"
  threshold: "80.0"
}

When adding a member for a specific CPU, say CPU-1, we’d map the token to the real dataId, something like:

dataIdMap: {
  "SystemLoad":"CPU-1_SystemLoad"
}

Where "CPU-1_SystemLoad" reflects the actual id associated with system load data sent to alerts for CPU-1.

When updating conditions at the group level it is necessary to supply dataId mappings for all of the existing members because the dataIds may have changed on the new condition set.

Orphans

There are times when a particular group member may need to managed individually. For example, if a single CPU is of particular concern it may be useful to change the threshold level on just that member. It is possible to orphan a member trigger and manage it independently, while maintaining it’s association with the group trigger. It can be unorphaned at any time, and reset to the group settings.

Data-Driven Group Triggers

Group triggers allow a common definition to be applied to logically similar members. For example, a group trigger could be defined for alerting on CPU SystemLoad and a member trigger would be added for every CPU, each a copy of the group trigger but working against the proper dataId(s) given the CPU instance. When a member is added a map from the group’s [token] dataIds to the members [real] dataIds must be provided. And if updating conditions at the group level a map for each existing member must be provided. This makes sense, and is fine, but it can be tedious, or difficult to supply.

It’s not uncommon for the member-level dataIds to be a concatenation of id of the source member (e.g. a resourceId, CPU-1, etc) and the group level dataId token (SystemLoad). So you end up with member-level ids like 'CPU-1_SystemLoad' where the "source" is 'CPU-1' and the dataId is 'SystemLoad'.

Data-Driven Group Triggers are able to add member triggers to a group automatically, one for each "source" of the same data. In other words, for a group trigger on CPU SystemLoad, add a member automatically for each source CPU reporting the 'SystemLoad' metric. By reporting data as a combination of source and dataId this should be possible. So, instead of reporting:

Data(id:cpu-1-Load, value:123)

We’d want:

Data(source:cpu-1, id:Load, value:123)

This would then relieve the client from having to add member triggers up front and instead assume that the group will grow as needed, based on the incoming data.

Because dataIds are often defined upstream it is not always possible to supply Hawkular Alerting with data such that the source and id are separated. But if possible this is a power ful approach.

Behavioral Notes

A couple of notes about data-driven group triggers:

  • Each member trigger is associated with a single source and only considers data from that source.

    • True for single and mult-condition triggers.

  • Condition changes in the group trigger will remove all member triggers.

    • The members will then again be created as the data demands.

  • The Source mechanism can also be used with manually managed triggers, if desired.

Alert Lifecycle

Hawkular Alerting can integrate with other systems to handle Alert Lifecycle, but alerts can also be managed directly within the tool. Hawkular Alerting supports a typical move through a simple lifecycle. An alert starts in OPEN status, optionally moves to ACKNOWLEDGED to indicate the alert has been seen and the issue is being resolved, and is finally set to RESOLVED to indicate the problem has been fixed.

AutoResolve

Triggers require firing conditions and always start in Firing mode. But the trigger can optionally supply autoResolve conditions. If autoResolve=true then after a trigger fires it switches to AutoResolve mode. In AutoResolve mode the trigger no longer looks for problem conditions, but instead looks for evidence that the problem is resolved. A simple example would be a trigger that has a firing condition of Availability DOWN, and an autoResolve condition of Availability UP. This mechanism ensures that only one alert is generated for a problem, and that when the problem has been resolved, the trigger automatically returns to firing mode.

Moreover, if autoResolveAlerts=true then when the AutoResolve conditions are satisfied all of its unresolved alerts will be automatically set RESOLVED.

Like Firing mode, AutoResolveMode can optionally define its own dampening setting.

Tags

Tags can have a variety of uses but are commonly used to assist in search. Tags are free-formed name-value pairs and can be applied to: * Triggers * Alerts * Events

Tags on triggers are automatically passed on to the Alerts or Events generated by that trigger. This allows the same search criteria used to fetch triggers to also be used to fetch the alerts or events generated by those triggers.

A tag’s name and value must both be non-empty. But tag search allows for matching just the name by specifying value='*' in the search criteria.

REST API

Hawkular Alerting supports a robust REST API for managing Triggers, Alerts and Events. For more on how to generate API documentation, see the README.adoc at Hawkular-Alerts @ GitHub.

External Alert Integration

There are times when an external system will already be looking for and detecting potential issues in its environment. It is possible for these detection-only systems to leverage the power of Hawkular Alerting' trigger and action infrastructure. For example, let’s say there is already a sensor in place looking for overheating situations. When it detects something overheating it can take some action. In this case we are not sending a stream of heat readings to alerting and having it evaluate against a threshold set on a trigger condition. Instead, the threshold and evaluation are all built into the sensor. To integrate with Hawkular Alerting we can use an "External Condition".

External Conditions

External integration begins with standard triggers. In this way we immediately get everything that triggers offer: actions, dampening, lifecycle, auto-resolve, etc. The difference is that instead of the typical condition types: Threshold, Availability, etc.., we can use an ExternalCondition. An external condition is like other conditions in that it has a 'dataId' with which it matches data sent into Hawkular Alerting. It also has 'systemId' and 'expression' fields. The systemId is used to identify the external system for which the condition is relevant. In our example, perhaps "HeatSensors". The expression field is used as needed. In our example it may not be needed or it could be a description like, "sensor detected high temperature". In other examples it could be used to store a complex expression that will be evaluated by the external system.

The main thing about external conditions is that they always evaluate to true. It is assumed that when a datum comes in with a dataId assigned to an external condition that that condition immediately evaluates to true. A trigger with a single external condition (and default dampening) would fire on every datum sent in for it’s condition. This is because it is assumed the external system already did the work of determining there was an issue.

Note that the string data sent in has any value the external alerter system wants it to be. In our example it may be a sensorId and temperature, like "Sensor 5368, temperature 212F".

Actions Plugins

Plugins are responsible to execute actions when an alert, or possibly an event, happens.

Actions can be a notification task or a complex process.

Hawkular Alerting provide a plugin architecture to extend and add new behaviours.

Create a new plugin

We can add a new plugin in hawkular in several steps:

  • Create a new project under hawkular-alerts-actions-plugins.

You can use an existing one as a template i.e. hawkular-alerts-actions-generic
  • Add an implementation of org.hawkular.alerts.actions.api.ActionPluginListener interface.

  • Add a plugin name to the implementation with the org.hawkular.alerts.actions.api.ActionPlugin annotation.

For example:

@ActionPlugin(name = "file")
public class FilePlugin implements ActionPluginListener {
    ...
}

ActionPluginListener interface

This interface has the responsability of

  • Define which properties and default values are supported by a plugin

...
    /**
     * The alerts engine registers the plugins available with their properties.
     * This method is invoked at plugin registration time.
     *
     * @return a list of properties available on this plugin
     */
    Set<String> getProperties();

    /**
     * The alerts engine registers the plugins available with their default values.
     * This method is invoked at plugin registration time.
     * Default values can be modified by the alerts engine.
     *
     *
     * @return a list of default values for properties available on this plugin
     */
    Map<String, String> getDefaultProperties();
...
  • Process an incoming action message wrapped as a org.hawkular.alerts.actions.api.ActionMessage

...
    /**
     * This method is invoked by the ActionService to process a new action generated by the engine.
     *
     * @param msg message received to be processed by the plugin
     * @throws Exception any problem
     */
    void process(ActionMessage msg) throws Exception;
...

ActionMessage interface

This interface is a wrapper of the action sent by the engine with the effective properties to use by the plugin to process it.

package org.hawkular.alerts.actions.api;

import java.util.Map;

import org.hawkular.alerts.api.model.action.Action;

import com.fasterxml.jackson.annotation.JsonInclude;

/**
 * A message sent to the plugin from the alerts engine
 * It has the event payload as well as action properties
 *
 * @author Lucas Ponce
 */
public interface ActionMessage {

    @JsonInclude
    Action getAction();
}

The class org.hawkular.alerts.api.model.action.Action is generated for the engine and it has the event detail as part of its payload.

/**
 * A base class for action representation from the perspective of the alerts engine.
 * An action is the abstract concept of a consequence of an event.
 * A Trigger definition can be linked with a list of actions.
 *
 * Alert engine only needs to know an action id and message/payload.
 * Action payload can optionally have an event as payload.
 *
 * Action plugins will be responsible to process the action according its own plugin configuration.
 *
 * @author Jay Shaughnessy
 * @author Lucas Ponce
 */
public class Action {

    @JsonInclude
    private String tenantId;

    @JsonInclude
    private String actionPlugin;

    @JsonInclude
    private String actionId;

    @JsonInclude(Include.NON_NULL)
    private String eventId;
...
}

redhatlogo-white

© 2016 | Hawkular is released under Apache License v2.0