Blog Post

Mitigating Alarm Storms in GroundWork Monitor


July 15, 2020

Mitigating Alarm Storms using GroundWork MonitorGroundWork Monitor offers Parent Child configurations for distributed monitoring, enabling the monitoring of a subset of an infrastructure where Child servers report the state and performance metrics to a central, or “Parent” GroundWork server.

What this Blog post is focused on is not a Parent Child architecture configuration, but instead the other kind of Parent Child: the relationships and inherent dependencies that can be configured to control the behavior of hosts and services based on the status of one of more other hosts and services.

Alarm Storms

Getting too many alerts can desensitize system administrators and cause issues to go unnoticed, despite receiving alerts related to an event. Too many alerts can also reduce your time to resolve issues by not clearly identifying the point of failure.

Two of the most common examples of causes for alarm storms are:

  • Loss of communication to a network device. Whether due to a firewall rule change or similar network failure, this type of event can cause every monitored host and service to lose network communication and in turn generate alerts for the faux failed state.
  • Host or application failure. A database is a great example of this scenario. Basic monitoring for an Oracle database will include checks for overall database availability, size of tablespace, locked objects, and the percentage of max processes in use. The underlying Operating System will generally also have basic checks in place for disk, cpu, memory, uptime, and ping. In this scenario, if the server were to power off, we could receive nine alerts for a single system event, with no clear indication of what exactly the problem is.

Alarm storms such as these can lead systems administrators to be complacent to ticket floods, and can result in delayed response times.

Cutting Through the Noise

The scenarios which cause alarm storms can be easily mitigated by implementing GroundWork Monitor features which are easy to configure and maintain:

  • Parent/Child: For physical dependencies such as a network device to a group of hosts.
  • Host Dependencies: For logical dependencies such as an authentication server which relies on an external database on a different host to properly function.
  • Service dependencies: For services which if they fail, other services will also fail.

Using Parent/Child relationships, you can suppress alarms for all child hosts in the directive should the parent system fail, all child systems to the parent will be set to an UNKNOWN state.

With host and service dependencies configured, you can choose to disable notifications and/or service checks should a master service or host fail.

Oracle Database Scenario

For example, using our Oracle database scenario we can define the ping check as the master service and the rest of the Operating System and application checks as dependents, and by doing so, should the host go down, we will receive a single, actionable alert that the host failed to ping, and be able to begin working the issue rather than the tickets.

Parent/Child Relationships Configuration

  1. In GroundWork Monitor, navigate to Configuration > Nagios Monitoring > Hosts.
  2. Expand the Parent Child option and click New.
  3. Select the Parent host from the drop-down list.
  4. Select the Child hosts from the right side, and click the Add button.
  5. Click Save, then Commit the configuration change (Configuration > Nagios Monitoring > Control > Commit).

For example, in this image we are configuring the database servers as children, and the network device which provides connectivity as the parent. Should the network device (Parent) fail, the database servers (Children) will go into an UNREACHABLE state, and the services to an UNKNOWN state, suppressing pointless alarms and checks until the parent device has recovered from failure.

Mitigating Alarm Storms

Host Dependency Configuration

  1. Navigate to Configuration > Nagios Monitoring > Hosts.
  2. Expand the Host Dependencies option and click New.
  3. On this page we can configure:
    • The Dependent host.
    • The Master host.
    • Whether or not to Inherit dependencies from the master, this allows you to create dependencies upon dependencies.
    • Execution failure criteria (Up, Down, Unreachable, Pending, None), for each item selected, if the master host is in the selected state, checks will not execute at all for the dependent hosts until the master host is in a state which is not selected.
    • Notification failure criteria (Up, Down, Unreachable, Pending, None), for each item selected, if the master host is in the selected state, notifications will not be processed until the master host is in a state which is not selected.
  4. Once you are satisfied with your selections, click Add, then Commit your changes.The image below is an example within the GroundWork UI of a SAP HANA instance which is connected via a site-to-site VPN which is being monitored, and if the VPN goes down, we also lose connectivity to the SAP HANA instance. By using this configuration, we will not get false-positive alerts for connectivity issues caused by the VPN connection, and can more quickly identify it is the VPN connection that is the point of failure.Mitigating Alarm Storms

Service Dependency Configuration

  1. Navigate to Configuration > Nagios Monitoring > Services.
  2. Expand the Service Dependencies option and click New.
  3. On this page we can configure:
    • The Service dependency template name.
    • The master Service name.
    • Execution failure criteria (OK, Warning, Critical, Unknown, None), for each item selected, if the master service is in the selected state, checks will not execute at all for the dependent hosts until the master service is in a state which is not selected.
    • Notification failure criteria (OK, Warning, Critical, Unknown, None), for each item selected, if the master service is in the selected state, notifications will not be processed until the master service is in a state which is not selected.
  4. Once you are satisfied with your selections, click Add, then Commit your changes. This will create a template that can be applied to dependent services.Below is an example of a configuration for a master service of hdbdaemon, which is a service check which monitors the SAP HANA system to ensure that the process named hdbdaemon is running correctly on the host. This process is responsible for the control of all of the other processes the database needs in order to function, so when it is not in operation, it is likely for the processes it is responsible for to also fail along with it.

    Now that the template is created for the master service, it can be applied to an existing service to make it dependent on the master:
  5. Navigate to Configuration > Nagios Monitoring > Services.
  6. Expand Services and click the service in which you want to be a dependent.
  7. Click the Service Dependencies tab.
  8. In the Dependency drop-down, select your new dependency template name.
  9. For Master service host, this will usually be set to the same host – but you can also set the master service to be on a different host if you need to (usually for distributed or HA applications).
  10. Click Add Dependency, and Commit your changes.Mitigating Alarm Storms
    For this SAP HANA system specifically, we created a dependency on the master service hdbdaemon for the dependent services, hdbindexserver, hdbnameserver, hdbpreprocessor, hdbwebdispatcher, and hdbxsengine. This way, when the daemon goes down, we get a single alert for the actual issue, instead of six alerts from a single event

In Summary

Utilizing Parent/Child relationships and Dependencies can allow IT administrators to better identify actual causes of failure, respond in a more timely manner, and it’s very easy to configure.

For more information on these features, visit the GroundWork Support to access the following articles:

GroundWork Open Source

www.gwos.com

Other Posts...

Live Webinar – 7/22/2020 10AM PST: GroundWork Monitor Enterprise 8 with Elasticsearch Integration

GroundWork & Elasticsearch®
Wednesday, July 22nd at 10AM PST

Sorry we missed you, and here’s the recording:

VIEW WEBINAR RECORDING

Read More

Docker Container Monitoring with GroundWork Cloud Hub

Why Containers?

Container technologies have captivated the computing world. Containers are the cornerstone for cloud computing and microservice architectures. Whether it be Docker™, Docker Compose™, or Kubernetes™, the IT world is embracing this technology with great enthusiasm.

How can you monitor containers? They are different from traditional hosts and servers. For one thing, they are not physical machines; nor are they virtual machines. Containers can be spun up to handle periodic load, and then torn down when no longer needed. With Kubernetes, containers can also be replicated and load balanced in pods across clusters.

Read More