distributed-vae

Distributed training of multi-arm variational autoencoding networks. This repository contains

experiment code for training multi-arm variational autoencoding networks with the fully-sharded data parallelism (FSDP) strategy in PyTorch
a tutorial on training using FSDP with some simple models on the MNIST dataset

Installation

To recreate the conda environment used for this project,:

Clone the repo

git clone https://github.com/AllenInstitute/distributed-vae.git
cd distributed-vae

Install torch with CUDA >= 2.0 and tqdm:

You can either recreate the exact conda environment we used for this project (which likely has more packages than you actually need) by:

conda env create -f environment.yml -n dist-mmidas

or, just follow the standard instructions for installing torch with CUDA >= 2.0 on your machine, as well as tqdm.

Activate the environment

conda activate dist-mmidas

Quick start

The most important part of this repository are the two files fsdp_tutorial.ipynb and fsdp_tutorial.py.

The file fdsp_tutorial.ipynb is a tutorial that walks through step-by-step on how to use the FSDP training strategy in PyTorch. This is likely what you are looking for. Activate your conda environment (instructions above) and walk through this notebook to learn how to use FSDP in PyTorch.

The file fsdp_tutorial.py is a Python script containing the same code as the tutorial notebook. This is suitable for running the tutorial code as a job on HPC cluster environments (such as SLURM).

TODO

sbatch script for running fsdp_tutorial.py on SLURM
Cleanup files in dist/ directory

Name		Name	Last commit message	Last commit date
Latest commit History 194 Commits
build/lib/mmidas		build/lib/mmidas
dist		dist
experiments		experiments
mmidas.egg-info		mmidas.egg-info
mmidas		mmidas
notebooks		notebooks
tests		tests
train-scripts		train-scripts
.DS_Store		.DS_Store
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
evaluation.py		evaluation.py
fsdp_mnist.py		fsdp_mnist.py
fsdp_tutorial.ipynb		fsdp_tutorial.ipynb
fsdp_tutorial.py		fsdp_tutorial.py
hello.py		hello.py
mmidas.toml		mmidas.toml
pyproject.toml		pyproject.toml
test.py		test.py
train.py		train.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

distributed-vae

Installation

Quick start

TODO

About

Releases

Packages

Contributors 2

Languages

License

AllenInstitute/distributed-vae

Folders and files

Latest commit

History

Repository files navigation

distributed-vae

Installation

Quick start

TODO

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages