Skip to content

Commit

Permalink
Merge pull request #4 from Quentin-M/func_testing
Browse files Browse the repository at this point in the history
Introduce functional testing
  • Loading branch information
Quentin-M authored Apr 6, 2018
2 parents f763533 + 2af8a70 commit 2f36dc5
Show file tree
Hide file tree
Showing 7,460 changed files with 2,913 additions and 3,808,524 deletions.
The diff you're trying to view is too large. We only load the first 3000 changed files.
4 changes: 4 additions & 0 deletions .dockerignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
.idea/
docs/
terraform/
vendor/
6 changes: 6 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
.idea/
.terraform
*.tfstate
*.tfstate.backup
*.tfvars
vendor/
35 changes: 32 additions & 3 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -1,16 +1,45 @@
FROM golang:1.9-alpine AS build-env

WORKDIR /go/src/github.com/quentin-m/etcd-cloud-operator
COPY . .
RUN go-wrapper install github.com/quentin-m/etcd-cloud-operator/cmd/operator

# Install & Cache dependencies
RUN apk add --no-cache git jq && \
go get github.com/Masterminds/glide && \
go get github.com/creack/yaml2json

RUN apk add --update openssl && \
wget https://github.com/coreos/etcd/releases/download/v3.3.0-rc.1/etcd-v3.3.0-rc.1-linux-amd64.tar.gz -O /tmp/etcd.tar.gz && \
wget https://github.com/coreos/etcd/releases/download/v3.3.3/etcd-v3.3.3-linux-amd64.tar.gz -O /tmp/etcd.tar.gz && \
mkdir /etcd && \
tar xzvf /tmp/etcd.tar.gz -C /etcd --strip-components=1 && \
rm /tmp/etcd.tar.gz

ADD glide.* ./
RUN glide install --strip-vendor && yaml2json < glide.lock | \
jq -r -c '.imports[], .testImports[] | {name: .name, subpackages: (.subpackages + [""])}' | \
jq -r -c '.name as $name | .subpackages[] | [$name, .] | join("/")' | sed 's|/$||' | \
while read pkg; do \
echo "${pkg}..."; \
go install ./vendor/${pkg} 2> /dev/null; \
done

# Fetch etcdctl
COPY . .
RUN go-wrapper install github.com/quentin-m/etcd-cloud-operator/cmd/operator
RUN go-wrapper install github.com/quentin-m/etcd-cloud-operator/cmd/tester

# Install ECO
COPY . .
RUN go-wrapper install github.com/quentin-m/etcd-cloud-operator/cmd/operator
RUN go-wrapper install github.com/quentin-m/etcd-cloud-operator/cmd/tester

# Copy ECO and etcdctl into an Alpine Linux container image.
FROM alpine

COPY --from=build-env /go/bin/operator /operator
COPY --from=build-env /go/bin/tester /tester
COPY --from=build-env /etcd/etcdctl /usr/local/bin/etcdctl

RUN apk add --no-cache ca-certificates

ENTRYPOINT ["/operator"]
CMD ["-config /etc/eco/eco.yaml"]
59 changes: 28 additions & 31 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,36 +1,40 @@
# etcd-cloud-operator

Inspired by the [etcd-operator] designed for Kubernetes, the etcd-cloud-operator
manages etcd clusters deployed on cloud providers and helps human operators keep
the data store running safely, even in the event of availability-zone wide
failures.
Maintained by a former CoreOS engineer and inspired from the [etcd-operator]
designed for Kubernetes, the etcd-cloud-operator automatically bootstraps,
monitors, snapshots and recovers etcd clusters on cloud providers.

Used in place of the etcd binary and with minimal configuration, the operator
handles the configuration and lifecycle of etcd, based on data gathered from
handles the configuration and lifecycle of etcd, based on data gathered from
the cloud provider and the status of the etcd cluster itself.

The operator makes the assumption that it can trust the cloud provider's
auto-scaling group feature to provide accurate information regarding the number
of launched instances, and to automatically kill/re-provision crashed ones (e.g.
when an rack switch went down or simply when the service health check has been
failing for an extended amount of time).
In other words, the operator operator is meant to help human operators sleep
at night, while their mysterious etcd data store keeps running safely, even
in the event of process, instance, network, or even availability-zone wide
failures.

## Features

- *Failure recovery*: Upon failure of a minority of the etcd members, the
operator will automatically attempt to restart (rejoin if necessary) the member
it manages, thus recovering from the failure.
- *Disaster recovery*: In the event of a failure of the majority of the members,
resulting in the loss of quorum, the operator may try (if enabled) to seed a
new cluster from a backup, once the expected amount of instances are present
and the failed etcd cluster has been shot in the head (after forced backup of
its remaining healthy members).
- *Snapshots*: The operator realizes backups of each etcd member periodically,
to enable automated disaster recovery or manual recovery in case of force
majeure.
- *Resize*: By abstracting the cluster management, resizing the cluster becomes
- *Resize*: By abstracting cluster management, resizing the cluster becomes
straightforward as the underlying auto-scaling group can simply be scaled as
desired.

- *Snapshots*: Periodically, snapshots of the entire key-value space are
captured, from each of the etcd members and uploaded to an encrypted external
storage, allowing the etcd (or human) operator to restore the store at a later
time, in any etcd cluster or instance.

- *Failure recovery*: Upon failure of a minority of the etcd members, the
managed members automatically restarts and rejoins the cluster without
breaking quorum or causing visible downtime - First by simply trying to rejoin
with their existing data set, otherwise trying to join as a new member with a
clean state, or by replacing the entire instance if necessary.

- *Disaster recovery*: In the event of a quorum loss, consequence of the
simultaneous failure of a majority of the members, the operator coordinates
to snapshot any live members and cleanly stop then, before seeding a new cluster
from the latest data revision available once the expected amount of instances
are ready to start again.

The operator and etcd cluster can be easily configured using a [YAML file]. The
configuration notably includes clients/peers TLS encryption/authentication, with
Expand All @@ -42,15 +46,8 @@ is desired but authentication is not.
Running a managed etcd cluster using the operator is simply a matter of running
the operator binary in a supported auto-scaling group (as of today, AWS only).

A Terraform [module] is available to easily try the operator out or integrate it
into your infrastructure.

## Additional areas of interest

- Exposing Prometheus data about the cluster's health and resource usage,
including the availability zones spread where etcd is deployed.
- Document use-cases, user-stories and statistics regarding failures.
- Adding support for major cloud-providers, such as Azure and GKE.
A Terraform [module] is available to easily bring up production-grade etcd clusters
managed by the the operator out, and integrate them into your infrastructure.

[etcd-operator]: https://github.com/coreos/etcd-operator
[YAML file]: config.example.yaml
Expand Down
4 changes: 1 addition & 3 deletions cmd/operator/config.go
Original file line number Diff line number Diff line change
Expand Up @@ -36,9 +36,7 @@ type config struct {
func defaultConfig() config {
return config{
ECO: operator.Config{
CheckInterval: 30 * time.Second,
AutoDisasterRecovery: false,
UnseenInstanceTTL: 60 * time.Second,
UnhealthyMemberTTL: 2 * time.Minute,
Etcd: etcd.EtcdConfiguration{
DataDir: "/var/lib/etcd",
PeerTransportSecurity: etcd.SecurityConfig{
Expand Down
13 changes: 9 additions & 4 deletions cmd/operator/operator.go
Original file line number Diff line number Diff line change
Expand Up @@ -17,13 +17,17 @@ package main

import (
"flag"
"io/ioutil"
"os"
"strings"


etcdcl "github.com/coreos/etcd/clientv3"
"github.com/coreos/pkg/capnslog"
"github.com/quentin-m/etcd-cloud-operator/pkg/operator"
log "github.com/sirupsen/logrus"

"google.golang.org/grpc/grpclog"

"github.com/quentin-m/etcd-cloud-operator/pkg/operator"

// Register providers.
_ "github.com/quentin-m/etcd-cloud-operator/pkg/providers/asg/aws"
_ "github.com/quentin-m/etcd-cloud-operator/pkg/providers/asg/docker"
Expand All @@ -44,6 +48,7 @@ func main() {
log.SetLevel(logLevel)
log.SetFormatter(&log.TextFormatter{FullTimestamp: true})
capnslog.MustRepoLogger("github.com/coreos/etcd").SetLogLevel(map[string]capnslog.LogLevel{"etcdserver/api/v3rpc": capnslog.CRITICAL})
etcdcl.SetLogger(grpclog.NewLoggerV2(ioutil.Discard, ioutil.Discard, os.Stderr))

// Read configuration.
config, err := loadConfig(*flagConfigPath)
Expand All @@ -52,5 +57,5 @@ func main() {
}

// Run.
operator.Run(config.ECO)
operator.New(config.ECO).Run()
}
58 changes: 58 additions & 0 deletions cmd/tester/config.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
// Copyright 2017 Quentin Machu & eco authors
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.

package main

import (
"io/ioutil"
"os"

log "github.com/sirupsen/logrus"
"gopkg.in/yaml.v2"

"github.com/quentin-m/etcd-cloud-operator/pkg/tester"
)

// config represents a YAML configuration file that namespaces all ECO tester configuration under the
// top-level "eco-tester" key.
type config struct {
ECOTester tester.Config `yaml:"eco-tester"`
}

// loadConfig is a shortcut to open a file, read it, and generate a
// config.
//
// It supports relative and absolute paths. Given "", it returns defaultConfig.
func loadConfig(path string) (config, error) {
config := config{}

f, err := os.Open(os.ExpandEnv(path))
if err != nil {
return config, err
}
defer f.Close()

d, err := ioutil.ReadAll(f)
if err != nil {
return config, err
}

err = yaml.Unmarshal(d, &config)
if err != nil {
return config, err
}

log.Infof("loaded configuration file %v", path)
return config, err
}
51 changes: 51 additions & 0 deletions cmd/tester/tester.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
// Copyright 2017 Quentin Machu & eco authors
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.

// Package main implements basic logic to start the etcd-cloud-operator.
package main

import (
"flag"
"os"
"strings"

"github.com/coreos/pkg/capnslog"
log "github.com/sirupsen/logrus"

"github.com/quentin-m/etcd-cloud-operator/pkg/tester"
)

func main() {
// Parse command-line arguments.
flag.CommandLine = flag.NewFlagSet(os.Args[0], flag.ExitOnError)
flagConfigPath := flag.String("config", "", "Load configuration from the specified file.")
flagLogLevel := flag.String("log-level", "info", "Define the logging level.")
flag.Parse()

// Initialize logging system.
logLevel, err := log.ParseLevel(strings.ToUpper(*flagLogLevel))
log.SetOutput(os.Stdout)
log.SetLevel(logLevel)
log.SetFormatter(&log.TextFormatter{FullTimestamp: true})
capnslog.MustRepoLogger("github.com/coreos/etcd").SetLogLevel(map[string]capnslog.LogLevel{"etcdserver/api/v3rpc": capnslog.CRITICAL})

// Read configuration.
config, err := loadConfig(*flagConfigPath)
if err != nil {
log.WithError(err).Fatal("failed to load configuration")
}

// Run.
tester.Run(config.ECOTester)
}
7 changes: 3 additions & 4 deletions config.example.yaml
Original file line number Diff line number Diff line change
@@ -1,9 +1,8 @@
eco:
# The interval between each cluster verification by the operator.
check-interval: 30s
# The time after which, the member of an instance unseen for that duration,
# will be removed from the cluster.
unseen-instance-ttl: 60s
check-interval: 15s
# The time after which, an unhealthy member will be removed from the cluster.
unhealthy-member-ttl: 30s
# Defines whether the operator will attempt to seed a new cluster from a
# snapshot after the managed cluster has lost quorum.
auto-disaster-recovery: true
Expand Down
47 changes: 47 additions & 0 deletions docs/testing/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
## Functional Testing

The functional test suite verifies that the an etcd cluster, managed by the
etcd-cloud-operator, is able to cope with severe failures injected while the cluster
is under high pressure. This is similar to the upstream's [functional test suite].

There are obviously smarter ways to automate/run this test suite, but its development
time is limited, therefore focus was on making this happen quickly.

### Running the tests.

- Make sure the ECO deployment has a sufficient backend quota (e.g. `8589934592`).
- Open port 22 on the security group for ECO instances

- Edit the present config.yaml to match the ECO deployment
- Create an Ubuntu EC2 instance in the same VPC as the ECO deployment, with ports 22/3000 opened
- Open four shells and execute the following commands:

```
fswatch -o ./ | while read num; do rsync -avz ./ ubuntu@<ubuntu instance's address>
```

```
ssh -A ubuntu@<ubuntu instance's address>
sudo -E su
apt update && apt install docker.io
curl -L https://github.com/docker/compose/releases/download/1.18.0/docker-compose-`uname -s`-`uname -m` -o /usr/bin/docker-compose && chmod +x /usr/bin/docker-compose
cd /home/ubuntu/docs/testing && docker-compose up
```

```
ssh -A ubuntu@<ubuntu instance's address>
sudo -E su
docker exec -it $(docker ps|grep tester|awk '{print $1}') bash
cd /go/src/github.com/quentin-m/etcd-cloud-operator/docs/testing/
go install -v github.com/quentin-m/etcd-cloud-operator/cmd/tester && tester -config=config.yaml -log-level=debug
```

```
ssh -A ubuntu@<ubuntu instance's address>
watch e --endpoints=<1rd instance's ip>:2379,<2nd instance's ip>:2379,<3rd instance's ip>:2379 --dial-timeout=1s --command-timeout=1s endpoint status -w table
```

- Login to `http://<ubuntu instance's address>:3000` with `admin:password`

[functional test suite]: https://github.com/coreos/etcd/tree/master/tools/functional-tester
8 changes: 8 additions & 0 deletions docs/testing/config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
eco-tester:
cluster:
address: <cluster's load balancer address>
size: 3
tls:
cert-file: client.crt
key-file: client.key
trusted-ca-file: ca.crt
Loading

0 comments on commit 2f36dc5

Please sign in to comment.