Cloudflare account, domains, API token
For now, I'll deploy from my local computer - I just don't trust myself to properly set up a CI to not leak credentials and fuck up production systems!
- (not yet implemented) run terraform to apply anything, generate secrets and dump them into gitignored files
- run
make deploy
to rsync this entire goddamn folder to the VPS over SSH or something (requires me to setup rsync first, or just scp or whatever it's called) - run
make up
to reload all the goddamn things (note that docker-compose only reloads containers that have changed - either image or the actual docker-compose config)
Checklist for web-facing services:
- Traefik labels (incl. entrypoint)
- Authelia middleware label
- Flame label
- Restart policy
- Networks
Checklist for internal services:
- Restart policy
- Networks
Use ~.ssh/config
to configure a specific host to connect to in the Makefile, named vultr
, with your server's HostName and User.
Then tell Makefile to use that server SSH configuration using the SERVER
variable.
The way state is handled in this stack is that all stateful applications have volumes mounted onto where they need to read/write, and then we "aggregate" said volumes (living in the host) to backup/restore them in one go (i.e. we backup the "top-level" volumes folder containing all of the volume mounts in one go, and we restore that same folder so that the volume mounts are all restored in one go).
For databases, there's an additional layer of complexity in that what's on the filesystem may not necessarily reflect the "current state" of the application, as databases commonly "flush" to WAL, but doesn't necessarily update its own state files, trusting the WAL as the thing to restore from when recovering from a shutdown.
Therefore, for the databases, we forcefully "flush" their entire state into something we can backup/restore from (we'll call them "dump files" for now), and make sure to run the "dump state"/"restore state from the dump" whenever we backup/restore (in particular, the "dump state" should be run before we run the backup, so the dump files are included in the backup, and the "restore from the dump" should be run after we restore the folder, so that we actually have the dump files to work off of).
Currently, we are backing up the /var/lib/docker/volumes
folder (which contains all Docker-managed volumes), as we mount all stateful applications onto named docker volumes.
The backup/restore process of that "top-level" volumes folder is handled by restic
, which takes incremental snapshots of that folder as a whole, letting you "time travel" back to past snapshots, with very minimal costs (in terms of the additional disk space).
The actual backup is automatically done by the restic-backup
container, which runs backup on startup (i.e. when you re-deploy that container, it will backup) and on schedule. The container already contains all of the scripts necessary for "dumping" databases.
You can also run the make backup
command, which uses that exact container to backup the same way it normally does during scheduled backups, with the only difference being that the command is a "one shot" (i.e. doesn't schedule further backups, and exits with a status code 0 upon a successful backup and a nonzero code + an alert to Slack upon failure).
To restore, we first need to select the snapshot that we want to restore from (which will be especially relevant if you "fucked up" something and want to time travel to before you fucked up).
You can either choose from the latest snapshot (make restore
), or specify a snapshot to restore from. For the latter, you can figure out which snapshot you want to restore from by running make list-snapshots
to list all of the snapshots you can choose from. Copy the ID of the snapshot you want, and pass it into make restore SNAPSHOT=$id
.
The restore script automatically handles "re-hydrating" from the database dump files.
One thing of note is that the restore should be done on an "empty state", so for databases like mysql, where its "running state" (and not the dump) is stored in its own volume, we explicitly exclude those docker volumes from backup, so that we restore from the dump on an "empty state" for such containers.
Once the restore is done, you can now start up the containers, given that the states are now restored and safe to be read.
Note:
restic
and all of its subcommands will need to secure an exclusive lock, and they do this by touching some files in your backup destination. However, sometimes it doesn't work (especially when you have multiple processes running at the same time), perhaps due to the "object storage" of choice being eventually consistent. In that case, you need to break the lock (after making sure no otherrestic
process is currently running) by running:restic unlock
(this can be run inside any of the
restic
containers - backup/prune/check)
We try to push as much of the stack onto Docker so that they are managed by it, and can have its lifecycle determined by it. For example, instead of creating the networks outside of the docker-compose stack and injecting them as "external: true", we let Docker itself create/destroy the networks as the stack is being created/deleted.
This also serves as a way to "gc" the resources automatically.
Docker networks need special attention due to the way docker-compose works.
Basically, when you ask Docker to create a network of name foo
, it will create a docker network of name foo
. Simple enough.
However, when you ask docker compose to create a network of name foo
while you're in a folder named folder_name
, it will create a docker network of name folder_name_foo
, because a. docker compose defaults its COMPOSE_PROJECT_NAME
to the folder name, and b. docker compose prefixes the networks it creates/manages with the COMPOSE_PROJECT_NAME
.
Thus, we manually set COMPOSE_PROJECT_NAME
in our .env
(to override docker-compose default), and tell Traefik to look at not the public
network, but the ${COMPOSE_PROJECT_NAME}_public
network instead (as Traefik doesn't know anything about docker compose prefixing network names).
Just as we push all "resources" to Docker/Compose to be managed by it, we also try to push as much of the service lifecycle onto the orchestrator as possible to make our lives easier.
One of the ways we do it is by 1. explicitly adding Docker healthchecks to as many containers as possible, and 2. setting service dependencies on each other.
For example, for a service that uses mysql and redis, we might mark it with the following:
depends_on:
mysql:
condition: service_healthy
redis:
condition: service_healthy
Docker then uses this information to basically build a DAG of service dependencies, such that:
- when starting everything from scratch (such is the case when you're restoring from a "blank slate"), the service won't be started until mysql and redis are up and running.
- when shutting everything down, the service goes down first before mysql and redis.
- when restarting mysql or redis, the service, too, shuts down and waits mysql/redis is up and running before starting up the service.
All of this ensures that no matter what we do to the stack (take it all up/down, selectively destroy or recreate containers, etc), the orchestrator will always ensure that service dependencies are met.
To get the benefits of DX tooling (including git hooks), you need to install the node dependencies.
First, install nvm and use the version of node specified in this repository (nvm use
).
Then, it's just a simple matter of running npm install
, and all of the git hooks will be automatically installed.
First, install pyenv to control the version of python used (mainly for consistency and reproducibility): https://github.com/pyenv/pyenv#installation
(Optionally, install shell integrations: https://github.com/pyenv/pyenv#set-up-your-shell-environment-for-pyenv)
If you don't have the specific version in the .python-version
, install the version specified in the repository (in this case, 3.11):
pyenv install 3.11
Then, to use the version of python specified in the repository (automatically), run:
pyenv local # every time you open up this directory, pyenv will automatically switch to the repo's specified python version
Now that we have pyenv set up for consistent python versions, we can install poetry for that specific python version:
pip install poetry # note that you don't have to specify pip3 thanks to pyenv
Then, "activate" the poetry environment using our pyenv-provided python:
poetry env use 3.11 # you may have to specify the full path to the pyenv python: https://python-poetry.org/docs/managing-environments/#switching-between-environments
Finally, we can install everything:
poetry install
Hooray!
Note: by default, poetry installed via pyenv-provided python will install its dependencies inside the
.venv
folder, allowing your editor (like VS Code) to automatically pick up on the python dependencies when you use them in your code. However, if it doesn't, you may have to set thevirtualenvs.in-project
option to configure poetry to install the dependencies locally: https://python-poetry.org/docs/configuration#virtualenvsin-project (and this requires destroying and recreating the poetry environment).
Because poetry controls the dependencies (and "regular" python can't see its virtualenv), you need to use poetry to run any python commands.
You can either drop into a shell where all of that is already pre-configured:
poetry shell
Or, alternatively, just run your python commands using poetry:
poetry run python diagram.py
And either approach will let the pyenv-controlled python binary (which means it's going to use the "right" version) to pick up on all of the virtualenv dependencies.
Happy hacking!
Passing down command line arguments and bespoke environment variables can only get you so far, and to really alleviate the "there's 5000 points I need to configure everywhere" problem, we're using templating as a solution (which will further help in cutting down bespoke command line/environment configuration to only where it's needed).
Here, we're using a "convention-over-configuration" approach to templating, and consuming the results of said templating.
Any file appended with a .j2
will be rendered down into a file with the same name, except with the .j2
extension stripped away (e.g. services/$service/docker-compose.$container.yml.j2
will be rendered down into services/$service/docker-compose.$container.yml
, and will be picked up by all of the Make commands as part of the docker-compose "fleet").
Since the rendered files 1. are auto-generated files (and thus don't belong in git histories), and 2. may contain sensitive secrets, we're intentionally choosing not to commit the rendered files; you will be able to see which files will be rendered by the presence of a .j2
file in the folder you're looking at.
NOTE: Since it's tricky to change the name in a way that's 1. obvious (e.g. if we were to generate
traefik.static.generated.yml
fromtraefik.static.yml.j2
for the sake of always making sure the generated files had a.generated
part in their file name so it'd be easier to grep all generated files and gitignore them, it would be confusing to try to guess what the generated file name would be when rendering a template), and 2. doesn't disrupt the existing flow (e.g. if we were to generatedocker-compose.$service.yml.generated
fromdocker-compose.$service.yml.j2
, existing workflows around grepping fordocker-compose.*.yml
would break), we're simply going to settle with stripping the.j2
from the template file name.That means there is NO way for us to automatically detect generated files based on their filename! So take care to add each generated file to the
.gitignore
list!
You can manually render the templates by running make render
(mainly to inspect the generated files to see if they look correct). For deployment, to make the process as seamless as possible, it will automatically be run as part of make deploy
, so there's no need to manually render down the templates before deployment to make the generated files reflect the latest values.
Right now, we're testing various pieces of "logic" (i.e. standalone functions that do not have external dependencies), but plan to expand the tests to cover more behaviours, such as e2e testing and integration testing (if I ever get to terraform modules), i.e. actually running the things and checking that the behaviour is what we expect.
For now, simply run make test
to run all the tests.
You'll note that all tests are marked either with @pytest.mark.unit
or @pytest.mark.integration
- appropriately, for unit tests and integration tests respectively.
This allows us to run only unit tests and only integration tests (make test-unit
and make test-integration
) and separate them out.
This is useful because unit tests, unlike integration tests, test only the specific bits of logic in its purest form; and so, they are able to be tested in complete isolation, with mocks provided to them so they don't hit the actual filesystem/make live network calls.
Note: you'll also note that in unit testing individual modules, I often mock out the actual I/O call that would've been made: whether it's network calls via the
responses
library, or filesystem access via thepyfakefs
library. Being able to fake I/O calls not only reduce the I/O latency in tests, but they also allow me to set up bespoke network/filesystem responses for every test without having to setup a pre-canned response (e.g. as with the fixtures/ folder that I use for integration tests) that needs to be shared by every unit test.
In comparison, integration tests test the executables that actually coordinate and run the bits of logic, interfacing with the "real world" (i.e. I/O, external dependencies). This means that it can't really be tested in isolation, though we can feed it fixtures (different from mocks) to keep the test results consistent.
This fundamental difference between testing isolated bits of logic vs. "executables" is why it's so useful to separate testing them - because, by their very nature, the integration tests are more likely to fail (due to the I/O involved) and in general will take longer (again, due to the I/O).
To mark the tests, we rely on yet another "convention over configuration": any tests that don't have explicit markings will be marked as a unit test. Any test with integration
in its test name (i.e. test_integration_*
) will be marked as an integration test.
You can debug tests by running only one of them, or forcing pytest to let log/print statements through.
You can pass any options to the make test-*
commands by setting the OPTIONS object. For example:
make test OPTIONS='-s' # to print output
We use Gitleaks for securing this repo, and you'll need to make sure you have it installed locally for scanning secrets on commit. You can install it by running brew install gitleaks
.
As for why Gitleaks... Trivy scanner doesn't match much of anything (see https://github.com/aquasecurity/trivy/blob/main/pkg/fanal/secret/builtin-rules.go), and while Trufflehog is awesome, it is not currently built out for "incremental" scans, such as scanning staged files.
If Trufflehog ever supports scanning one file at a time (or just integrates scanning staged files outright like gitleaks), I will drop gitleaks in a heartbeat. Until then, integrating gitleaks into pre-commit is the only "fast enough" way to do local, incremental scanning.
For CI, we do use the Trufflehog scanner because it scans all commits within a branch just fine.