Submit batch jobs to Kubernetes. Complements JupyterHub's support for interactive computing on Kubernetes.
This repository contains two packages: kbatch
is for users, kbatch-proxy
is for administrators deploying kbatch for their users.
kbatch
can be installed from source using pip:
$ pip install kbatch
kbatch
is typically uses JupyterHub for authentication, so you'll first need an API token. You can generate one by logging in and visiting the token generation page, typically at <JUPYTERHUB_URL>/hub/token
. Provide this token in place of <JUPYTERHUB_TOKEN>
below:
$ kbatch configure --kbatch-url="https://url-to-kbatch-server" --token="<JUPYTERHUB_TOKEN>"
This will create configuration file that specifies the default URL and credentials to use for all kbatch
operations.
At a minimum, jobs require
- A
name
to identify the job. - A
command
to run, as a list of strings (e.g.["ls"]
or["papermill", "my-notebook.ipynb"]
). - A container
image
to use (consider matching the one used on your Hub, perhaps one from pangeo-docker-images or [planetary-computer-containers])
$ kbatch job submit --name=list-files \
--command='["ls", "-lh"] \
--image=alpine
Additionally, you can provide code (either a directory or a single file) to make available on the server for your Job.
$ kbatch job submit --name=test \
--image="mcr.microsoft.com/planetary-computer/python" \
--command='["papermill", "notebook.ipynb"]' \
--file=notebook.ipynb
Rather than providing all those arguments on the command-line, you can create a YAML configuration file.
$ cat config.yaml
# file: config.yaml
name: "my-job"
command:
- sh
- script.sh
image: "mcr.microsoft.com/planetary-computer/python:latest"
code: "script.sh"
$ kbatch job submit -f config.yaml
Get the full help
$ kbatch job submit --help
Usage: kbatch job submit [OPTIONS]
Submit a job to run on Kubernetes.
Options:
-n, --name TEXT Job name.
--image TEXT Container image to use to execute job.
--command TEXT Command to execute.
--args TEXT Arguments to pass to the command.
-e, --env TEXT JSON mapping of environment variables for the job.
-d, -description TEXT A description of the job, optional.
-c, --code TEXT Local file or directory of source code to make
available to the job.
-f, --file TEXT Configuration file.
--kbatch-url TEXT URL to the kbatch server.
--token TEXT JupyterHub API token.
--help Show this message and exit.
$ kbatch job list
...
Show the detail on a given job
$ kbatch job show "<job-id>
Similar to jobs, cronjobs require name
, command
and container image
. However cronjobs also require a schedule
which follows the cron schedule syntax.
$ kbatch cronjob submit \
--name=list-files \
--image=alpine \
--command='["ls", "-lh"]' \
--schedule='0 22 * * 1-5'
This job will now run at 22:00 on every day-of-week from Monday through Friday indefinitely.
NOTE: Given that cronjobs run on a schedule indefinitely, the only way to stop them is to manually delete the cronjob. See the user guide docs for more information.
Note that this is the pod id, not the job id.
$ kbatch job logs "<pod-id>"
Your job probably involves some local files / scripts that are used by your job. How do we get those files from your local machine to the job?
When submitting the job, you can specify the path to the local code files to make available to the job. This can be either a single file (e.g. script.sh
or main.py
) or a directory of files (e.g. my-dir/
). The file will be present before your job starts up.
When you job starts executing, its working directory is /code
. So you can safely refer to relative paths like sh script.sh
or python my-dir/main.py
.
...
- Simplicity of implementation: https://words.yuvi.in/post/kbatch/ by Yuvi Panda captures this well.
- Simplicity of adoption: Users don't need to adapt their script / notebook / unit of work to the job system.
- Integration with JupyterHub: Runs as a JupyterHub services, uses JupyterHub for auth.
- Runs on Kubernetes: mainly for the simplicity of implementation, and also that's my primary use-case.
Together, these rule some great tools like Argo workflows, Ploomber, Elyra. So we write our own (hopefully simple) implementation.
We don't want to directly expose the Kubernetes API to the user. At the same time, we don't want a complicated deployment with its own state to maintain. We balance these competing interests by writing a very simple proxy that sits between the users and the Kubernetes API. This proxy is responsible for
- Authenticating users (typically by checking the
Bearer
token with a call to the JupyterHub API) - Authorizing the command (essentially, making sure that the API call only touches objects in the user's namespace)
- Submitting the call to the Kubernetes API, returning the results