Tenplex

Tenplex is a state management library for DL systems that enables jobs to change their parallelism dynamically after the GPU allocation changes at runtime.

You can find the Tenplex paper at https://arxiv.org/abs/2312.05181

About

Tenplex let's you train a model with multi-dimensional parallelism, i.e. tensor, data, and pipeline parallelism, resource-independently. That means you can change the resources during the training without affecting convergence.

When to use Tenplex?

Elasticity, e.g. spot instances
Redeployment, e.g. preemption
Failure recovery, e.g. GPU failure

We implemented the prototype with Megatron-LM to get the parallelization configuration for a given set of resources.

Install

Prerequisites

Go
Docker

Install tenplex-run

git clone https://github.com/kungfu-team/tenplex
cd tenplex
make install

Install Tensor Store (mlfs)

echo "deb https://europe-west2-apt.pkg.dev/projects/tenplex tenplex main" | sudo tee /etc/apt/sources.list.d/tenplex.list
curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | gpg --dearmor | sudo tee /etc/apt/trusted.gpg.d/packages-cloud-google-apt.gpg >/dev/null
sudo apt update
sudo apt install -y mlfs

Examples

Examples are in the benchmark directory. For instance, to run the dynamic resources benchmark in benchmark/dynamic_resources, just execute ./run.sh in the directory.

Citation

If you use Tenplex for your research, please cite our paper:

@inproceedings{wagenlander2024tenplex,
  title={Tenplex: Dynamic Parallelism for Deep Learning using Parallelizable Tensor Collections},
  author={Marcel Wagenlander, Guo Li, Bo Zhao, Luo Mai, Peter Pietzuch},
  booktitle={Proceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 156 Commits
.azure		.azure
.github/workflows		.github/workflows
ansible		ansible
benchmark		benchmark
experiments/cmd/tenplex-perf-impact		experiments/cmd/tenplex-perf-impact
ipv4		ipv4
man/man1		man/man1
mlfs		mlfs
para_config		para_config
scheduler		scheduler
scripts		scripts
state_transformer		state_transformer
tenplex-run		tenplex-run
tenplex		tenplex
tensor		tensor
tests		tests
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
Dockerfile		Dockerfile
Dockerfile-deepspeed		Dockerfile-deepspeed
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
azure-pipelines.yml		azure-pipelines.yml
build_docker.sh		build_docker.sh
build_docker_deepspeed.sh		build_docker_deepspeed.sh
go.mod		go.mod
go.sum		go.sum
run_test_load.sh		run_test_load.sh
setup.py		setup.py
show-go-mod.sh		show-go-mod.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tenplex

About

Install

Prerequisites

Install tenplex-run

Install Tensor Store (mlfs)

Examples

Citation

About

Releases

Packages

Contributors 2

Languages

License

kungfu-team/tenplex

Folders and files

Latest commit

History

Repository files navigation

Tenplex

About

Install

Prerequisites

Install tenplex-run

Install Tensor Store (mlfs)

Examples

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages