Alerting Backend

Summary:

The alerting backend is responsible for deploying statefulset(s) of Alertmanager clusters, henceforth called Alerting Cluster(s) which handles the Routing and Dispatching aspect of Opni Alerting, but is NOT not responsible for making observations on data.

In more detail, the Alerting Cluster(s) handle:

Rate limiting alerts
Aggregating alerts
De-duplicating alerts
Dispatching alerts, on a best effort basis
Managing a subset of Alert State(s) : Firing, Silenced

Architecture:

Alerting Backend

Description

The Opni controller manages the deployment of the Alerting Backend

Controller Logic

The reconciler reads the core.opni.io Gateway CRD in order to propagate changes to the Alerting Cluster.

The alerting controller reconciler uses the Gateway CRD's alerting spec to manage the Alerting Cluster:

a flag determining whether to deploy the Alerting Cluster or not
how to setup up the routing config map & its contents
how to scale up its alerting cluster via:
- via a # replica(s) flag
- Kubernetes resource limits applied to each instance in the AlertingCluster
- a subset of underlying AlertManager cluster configurations useful for scaling the performance of the cluster up and down, such as cluster-gossip-timeout.

Alerting Cluster Components

The alerting kubernetes controller is responsible for deploying & managing:

A worker Alertmanager cluser, as a stateful set
A controller Alertmanager cluster, where the members of this set are explicit cluster leaders instances that the worker cluster member instances must join.
PVCs & volumes that persist :
- the AlertManager instances stateful information (nflogs, silences, ...)
- the AlertManager configurations

Each active Opni Alerting deployment will contain at least one controller cluster AlertManager instance (in the minimal standalone case), and an HA instance will contain a variable set of both cluster controller instances and worker cluster instances.

The Alertmanager instances deployed by Opni-Alerting's controller are modified to contain an embedded server that runs on the same process as the base AlertManager instance. These embedded instances are responsible for injecting default webhooks for use by AlertManager.

The syncer server also accompanies each AlertManager instance, ensuring that updates pushed from the gateway make it to the the AlertManager instance in the form of a persisted configuration file & in its runtime configuration

Cluster Driver Dataflow

WIP

Responsibilities

dynamically deploy & scale the Alerting backend based on APIs

Restrictions & Limitations

Uses Alertmanager as a backend:

The runtime state is based on stateful data so must be deployed using Stateful sets
Runtime configuration is read statically via an AlertManager configuration file and must be mounted into each pod
Optimization of cluster performance is mostly specific to the static configuration

Scale and performance:

A description of how the system will be scaled and the expected performance characteristics, including any performance metrics that will be used to measure success.

Security:

A description of the security considerations for the system

High availability:

A description of how the system will be designed for high availability, including any redundancy or failover mechanisms that will be implemented.

Deployment is allowed to be configured as HA -- multiple AlertManager clusters join into a single AlertManager cluster
A controller cluster with pods that explicitly advertise a join address, is a redundancy mechanism that allows the worker set to scale more heavily.
Memberlist configurations for High Availability & scalability are exposed to the end user via the ops Server API.

Testing:

Testplan

Kubernetes driver is tested manually
Local driver is tested via integration tests (covers everything but the kubernetes deployment)

Architecture

Backends
Core Components
- Opni Gateway
- Opni Agent

How Tos

Releases

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alerting Backend

Alerting Backend

Summary:

Table of contents

Architecture:

Description

Controller Logic

Alerting Cluster Components

Cluster Driver Dataflow

Responsibilities

Restrictions & Limitations

Scale and performance:

Security:

High availability:

Testing:

Testplan

Clone this wiki locally