Skip to content

Latest commit

 

History

History
140 lines (92 loc) · 10.5 KB

design.md

File metadata and controls

140 lines (92 loc) · 10.5 KB

#SIMOORG - HIGH LEVEL DESIGN This document describes high level design of Simoorg: Linkedin’s Failure Inducing Framework. The rationale behind developing Simoorg is to have a simple yet powerful and extensible failure inducing framework. Simoorg is written in Python - Linkedin's lingua franca for solving operational challenges.

Key points

Key points of Simoorg are:

  • Scheduling and inducing failures against your application cluster of choice with a mechanism to revert and bring the cluster back to a clean state.
  • Providing customizable failure and revert scenarios shipped out of the box and ready to be plugged in and used immediately.
  • Comprehensive logging to help SREs and developers to get valuable insights about how their application of choice reacts to failures.
  • Support of heterogeneous infrastructure by introducing flexible execution handlers. New execution handlers are easy to plug in with minimal efforts.

##From a bird's eye view

Simoorg’s main job is to induce and revert failures against a service of your choice. The failures are induced based on the scheduler plugin type you wish to use. Simoorg comes with a non-deterministic scheduler configured, which generates failures at a random time. Although the failures are generated at a random time, you can still set a few limitations like: total run duration and min/max gap between failures. Each failure is followed by a revert, ensuring that the cluster we operate against is back to a clean state. Simoorg ensures logging of important metrics like failure name, impact and the time of the impact to help SREs and developers to reason about fault tolerance of their application of choice.

Simoorg comes with a predefined set of failures available for use right out of the box such as: graceful application bounce, ungraceful application shutdown, disk IO overuse and network latency. Additional failure definitions can be added by those who wish to have additional customized failure scenarios. To make sure Simoorg does not kill the entire cluster, constraints can be set to confine the damage within certain limits. Once those limits are reached no additional failures will be inflicted on the cluster.

Simoorg is designed to work in heterogeneous environments leveraging pluggable handlers. Handlers allow to induce failures against wide range of operating systems and platforms. At the present moment the only supported handler is ssh. It is fairly easy to add additional handlers upon need.

Design details

In the subsequent paragraphs we will cover important components and talk about how they interact with each other. Here is the list of integral parts of Simoorg:

  • Moirai
  • Atropos
  • Scheduler
  • Handler
  • Journal
  • Logger
  • HealthCheck
  • Topology
  • Api Server

###Moirai

Moirai is a single threaded process that monitors and manages individual Atropos instances using standard UNIX IPC mechanism and python queues. It also provides entry points for the Api Server to retrieve information about the various services being tested. Moirai takes configs directory path as an input argument and bootstraps the framework by reading configuration files in the configs directory. Configs directory contains:

  • Fate Books
  • Api configuration
  • Topology configuration
  • Handler Configuration

The structure of the config directory is discussed later on in this document. Once Moirai becomes aware of itself, it spawns individual observers called Atropos for each service defined in the configs. Each Atropos instance acts as an observer for the service it is responsible for and dispatches actions following a plan generated by the Scheduler.

Here each Atropos can communicate specific information to Moirai, with the help of two multiprocessing queues: a service information queue and an event queue. The service information queue captures information about Atropos like: service name, topology of that service and the plan which the current Atropos will be following. The event queue captures the status of each failure and revert action performed by Atropos. The high level design is reflected in the following diagram: High level Design

###Atropos

Upon initialization, each Atropos instance reads one Fate Book ([link][/docs/configs.rst]) and depending on the destiny defined in the Fate Book sleeps until requirements are met. Once requirements are met, Atropos induces a random failure, waits for the specified interval and reverts to bring the cluster to a clean state. There are two types of requirements to be met before inducing a failure:

  • Dispatch time arrived.
  • Number of currently impacted cluster nodes is not higher than the defined impact limit (More on impact limits in the Journal section).

Each Atropos instance has its own instance of a Scheduler which is in charge of generating a failure plan and keeping track of time. There can be different schedulers configured, however Simoorg is currently shipped with a non-deterministic scheduler at this time.

Apart from the Scheduler, each Atropos instance has its own instance of a Handler, Logger and Journal. The high level diagram reflecting Atropos and its components is as follows: Atropos Components

###Scheduler

A Scheduler generates a failure plan and keeps track of time. Currently Simoorg ships only with a Non-deterministic scheduler. The Non-deterministic scheduler randomly generates dispatch times and associates them with random failures. We refer to this sequence of timestamp and failures internally as a Plan. Once generated, the Plan is passed to Atropos.

###Handler

Each failure definition should have a handler associated with it. A Handler is referred to by its name within a failure definition and is responsible for inducing and reverting failures. The table below lists supported handlers and handlers planned to be available in future:

Handler name Description State Plans
ssh Runs actions over ssh using shell script mechanism supported
salt Runs actions over salt by leveraging salt modules or salt states not supported TBD
AWS AWS API calls not supported TBD
Rackspace Rackspace API not supported TBD

###Journal

Each Observer has a separate Journal instance. The Journal is responsible for:

  • Keeping track of the internal state of Atropos such as: current impact and impact limits
  • Persisting the current state of Atropos to support session resumption
  • Resuming state after a crash

###Logger

Each Atropos has a separate Logger instance. The Logger is used to log and store arbitrary messages spit out at various points of Plan execution.

###HealthCheck

Healthcheck is an optional component that allows you to control the damage inflicted against your service. If enabled, Atropos kicks off the healthcheck logic defined in the Fate Book before inducing a failure. The Healthcheck component needs to return success in order for the failure run. Otherwise the Scheduler skips the current failure cycle. This ensures that we are not aggravating any existing issues and lets the cluster fire self-healing routines and recover. If a healthcheck is not defined, failures will be induced as scheduled assuming the cluster was able to recover.

The default implementation of the Healthcheck component allows for plugging in any executable with logic. A sample healthcheck would be like this:

#!/bin/bash

if [ CHECK_SUCCESSFUL ]
then
   exit 0
else
   exit 1
fi

The best practice is to leverage your current monitoring system to identify the health of the system.

We also ship a simple kafka HealthCheck out of the box. This plugin considers a cluster to be healthy if the under replicated partition count is zero for all the nodes in cluster. The plugin also depends on the kafka topology config file to get information about the cluster.

###Topology

The Topology component is responsible for identifying and keeping the list of nodes that constitute your service. In most cases this is just a list of servers present in your cluster. The Topology component is also responsible for choosing a random node from the list and handing it over to Atropos. We ship a static topology and Kafka topology plugins with our source code.

Static topology is a list of nodes defined in a yaml format as follows:

#
# Static topology definition
#
topology:
  nodes: [‘node1’,’node2’,’node3’,’node4’]

Another example of topology is Kafka topology. It is a custom Topology component that was developed to support dynamic Kafka node elections. The Topology plugin for Kafka returns the list of different types of brokers on which failures can be induced. The plugin selects random brokers based on the kafka host resolution specification in the config file. Each host resolution specification should belong to one of the following types.

  • CONTROLLER - Where the node is the controller of the kafka cluster
  • RANDOM_BROKER - Where the node is a broker for a specific topic and a specific partition (if you skip the topic or partition it randomly selects a topic or partition)
  • RANDOM_LEADER - Where the node is a leader for a random topic and a random partition
  • LEADER - Where the node is a leader for a specific topic and a specific partition (if you skip the partition it randomly selects a partition)

###Api Server

Simoorg provides a simple API interface based on Flask. The API server communicates with Moirai process through linux FIFOs, so it is necessary that the Api Server is started on the same server as the Moirai process. The API endpoints currently supported by our systems are

  • GET /list - Gives the list of services against which simoorg is running (i.e the list of service-name provided in the fate books)
  • GET /servers - Gives the list of components (for example servers) that constitute a service (Output of topology plugin for that service)
  • GET /plan - Gives the plan currently followed for a specific service (Output of scheduler plugin for that service)
  • GET /events - Gives the status of each failure and revert event. Here each event constitutes of a list containing the following elements failure_name, trigger_time, node_name and trigger_status, where failure_name for failure events is same as the name of the failure listed in the fate book, while for revert events it is given as name of the failure + '-revert'. Also trigger_status is a boolean value representing whether the event was a success

The API service currently only supports retrieving data about currently running instances of Simoorg and does not provide historical data (i.e if the original simoorg instance crashes, we can no longer fetch any data from the API)