Steps to follow to add a new self-hosted runner for GitHub. You will need access to the Equinix account for Vitess's CI testing and Admin access to Vitess.
- Spawn a new c3.small instance and name it on the Equinix dashboard
- use ssh to connect to the server
- Install docker on the server by running the following commands
curl -fsSL https://get.docker.com -o get-docker.sh
sudo sh get-docker.sh
- Create a new user with a home directory for the action runner
useradd -m github-runner
- Add the user to the docker group so that it can use docker as well
sudo usermod -aG docker github-runner
- Switch to the newly created user
su github-runner
- Goto the home directory of the user and follow the steps in Adding self hosted runners to repository
mkdir github-runner-<num> && cd github-runner-<num>
curl -o actions-runner-linux-x64-2.280.3.tar.gz -L https://github.com/actions/runner/releases/download/v2.280.3/actions-runner-linux-x64-2.280.3.tar.gz
tar xzf ./actions-runner-linux-x64-2.280.3.tar.gz
./config.sh --url https://github.com/vitessio/vitess --token <token> --name github-runner-<num>
- With a screen execute
./run.sh
- Set up a cron job to remove docker volumes and images every other weekday
crontab -e
- Within the file add a line
0 5 * * 1,3,5 docker system prune -f --volumes --all
- Vtorc, Cluster 14 and some other tests use multiple MySQL instances which are all brought up with asynchronous I/O setup in InnoDB. This sometimes leads to us hitting the Linux asynchronous I/O limit.
To fix this we increase the default limit on the self-hosted runners by -
- To set the aio-max-nr value, add the following line to the /etc/sysctl.conf file:
fs.aio-max-nr = 1048576
- To activate the new setting, run the following command:
sysctl -p /etc/sysctl.conf
- To set the aio-max-nr value, add the following line to the /etc/sysctl.conf file:
Most of the code for running the tests is generated code by make generate_ci_workflows
which uses the file ci_workflow_gen.go
To move a unit test from GitHub runners to self-hosted runners, just move the test from unitTestDatabases
to unitTestSelfHostedDatabases
in ci_workflow_gen.go
and call make generate_ci_workflows
To move a cluster test from GitHub runners to self-hosted runners, just move the test from clusterList
to clusterSelfHostedList
in ci_workflow_gen.go
and call make generate_ci_workflows
You will need access to the self-hosted runner machine to be able to connect to it via SSH.
- From the output of the run on GitHub Actions, find the
Machine name
in theSet up job
step - Find that machine on the Equinix dashboard and connect to it via ssh
- From the output of the
Print Volume Used
step find the volume used - From the output of the
Build Docker Image
step find the docker image built for this workflow - On the machine run
docker run -d -v <volume-name>:/vt/vtdataroot <image-name> /bin/bash -c "sleep 600000000000"
- On the terminal copy the docker id of the newly created container
- Now execute
docker exec -it <docker-id> /bin/bash
to go into the container and use the/vt/vtdataroot
directory to find the output of the run along with the debug files - Alternately, execute
docker cp <docker-id>:/vt/vtdataroot ./debugFiles/
to copy the files from the docker container to the servers local file system - You can browse the files there or go a step further and download them locally via
scp
. - Please remember to cleanup the folders created and remove the docker container via
docker stop <docker-id>
.
There is currently one self-hosted runner which only hosts a single runner. This allows us to run tests that do not use docker on that runner.
All that is needed to be done is to add runs-on: single-self-hosted
, remove any code that downloads
dependencies (since they are already present on the self-hosted runner) and add a couple of lines to save
the vtdataroot output if needed.
9944 is an example PR that moves one of the tests to a single-self-hosted runner.
NOTE - It is essential to ensure that all the binaries spawned while running the test be stopped even on failure. Otherwise, they will keep on running until someone goes ahead and removes them manually. They might interfere with the future runs as well.
The logs will be stored in the savedRuns
directory and can be copied locally via scp
.
A cronjob is already setup to empty the savedRuns
directory every week so please download the runs
before they are deleted.
If the loads on the self-hosted runners increases due to multiple tests being moved to them or some other reason, they sometimes end up running out of disk space. This causes the runner to stop working all together.
In order to fix this issue follow the following steps -
ssh
into the self-hosted runner by finding its address from the equinix dashboard.- Clear out the disk by running
docker system prune -f --volumes --all
. This is the same command that we run on a cron on the server. - Switch to the
github-runner
usersu github-runner
- Resume an existing
screen
screen -r
- Start the runner again.
./run.sh
- Verify that the runner has started accepting jobs again. Detach the screen and close the
ssh
connection.