Skip to content

Alerting Backend

Alexandre Lamarre edited this page Feb 17, 2023 · 5 revisions

Alerting Backend

Summary:

The alerting backend is responsible for deploying statefulset(s) of Alertmanager clusters, henceforth called Alerting Cluster(s) which handles the Routing and Dispatching aspect of Opni Alerting, but is NOT not responsible for making observations on data.

In more detail, the Alerting Cluster(s) handle:

  • Rate limiting alerts
  • Aggregating alerts
  • De-duplicating alerts
  • Dispatching alerts, on a best effort basis
  • Managing a subset of Alert State(s) : Firing, Silenced

Table of contents

Architecture:

Alerting Backend

Description

The Opni controller manages the deployment of the Alerting Backend

Controller Logic

The reconciler reads the core.opni.io Gateway CRD in order to propagate changes to the Alerting Cluster.

The alerting controller reconciler uses the Gateway CRD's alerting spec to manage the Alerting Cluster:

  • a flag determining whether to deploy the Alerting Cluster or not
  • how to setup up the routing config map & its contents
  • how to scale up its alerting cluster via:
    • via a # replica(s) flag
    • Kubernetes resource limits applied to each instance in the AlertingCluster
    • a subset of underlying AlertManager cluster configurations useful for scaling the performance of the cluster up and down, such as cluster-gossip-timeout.

Alerting Cluster Components

The alerting kubernetes controller is responsible for deploying & managing:

  • A worker Alertmanager cluser, as a stateful set
  • A controller Alertmanager cluster, where the members of this set are explicit cluster leaders instances that the worker cluster member instances must join.
  • PVCs & volumes that persist :
    • the AlertManager instances stateful information (nflogs, silences, ...)
    • the AlertManager configurations

Each active Opni Alerting deployment will contain at least one controller cluster AlertManager instance (in the minimal standalone case), and an HA instance will contain a variable set of both cluster controller instances and worker cluster instances.

The Alertmanager instances deployed by Opni-Alerting's controller are modified to contain an embedded server that runs on the same process as the base AlertManager instance. These embedded instances are responsible for injecting default webhooks for use by AlertManager.

The syncer server also accompanies each AlertManager instance, ensuring that updates pushed from the gateway make it to the the AlertManager instance in the form of a persisted configuration file & in its runtime configuration

Cluster Driver Dataflow

WIP

Responsibilities

  • dynamically deploy & scale the Alerting backend based on APIs

Restrictions & Limitations

Uses Alertmanager as a backend:

  • The runtime state is based on stateful data so must be deployed using Stateful sets
  • Runtime configuration is read statically via an AlertManager configuration file and must be mounted into each pod
  • Optimization of cluster performance is mostly specific to the static configuration

Scale and performance:

A description of how the system will be scaled and the expected performance characteristics, including any performance metrics that will be used to measure success.

Security:

A description of the security considerations for the system

High availability:

A description of how the system will be designed for high availability, including any redundancy or failover mechanisms that will be implemented.

  • Deployment is allowed to be configured as HA -- multiple AlertManager clusters join into a single AlertManager cluster
  • A controller cluster with pods that explicitly advertise a join address, is a redundancy mechanism that allows the worker set to scale more heavily.
  • Memberlist configurations for High Availability & scalability are exposed to the end user via the ops Server API.

Testing:

Testplan

  • Kubernetes driver is tested manually
  • Local driver is tested via integration tests (covers everything but the kubernetes deployment)
Clone this wiki locally