Skip to content


Repository files navigation


This is a demo of pytorch distributed training. In this repo, you can find three simple demos for training model with several GPUs either on one single machine or several machines. The main code borrowed from pytorch-multigpu and pytorch-tutorial. I only do some code finishing work, thanks to the two guy. What's more, a sbatch sample will be given for running distributed training on a HPC (High performance computer).


  • Pytorch >= 1.0 is prefered.
  • Python > 3.0 is preferd.
  • NFS: all compute nodes are prefered to load data from the Network File System.
  • linux: the pytorch distributed package can run on linux only now.

Run the demos

Demo 1

This demo is based on the torch.nn.DataParallel(model), the simplest one to use multi GPU on a single compute node. A batch is automatically divided into N mini-batches and processed by N GPUs. The models of different GPUs maintain synchronized during the whole training process.


Demo 2

This demo is based on the PyTorch distributed package. There exists N individual training processes and each process monopolizes a GPU. Also, the models on different GPUs maintain synchronized during the whole training process. We use torch.distributed.launch to create N processes.

python -m torch.distributed.launch --nproc_per_node=2 --nnodes=1

Demo 3

In this demo, I will run three processes on three different compute nodes, and each process can use either one GPU or several GPUs on that compute node (the same way as demo 1 sided) specified by --gpu_devices 0 1. Of course, every computes node must have the same PyTorch runtime environment. I use NFS file as the init_method. Note that the NFS file(not exists, maybe for UDS) must be accessed by all the processes on different compute nodes because all the processes need this file to communicate with each other during the initial cluster built process. The NFS file will be automatically removed after training. There are also other ways to init a process group, please refer to here.

Manually launch a process on each computes node.

# node 1
python \
    --init_method file://<absolute path to nfs file> \
    --rank 0 \
    --world_size 3\
    --gpu_devices 0 1
# node 2
python \
    --init_method file://<absolute path to nfs file> \
    --rank 1 \
    --world_size 3\
    --gpu_devices 0 1
# node 3
python \
    --init_method file://<absolute path to nfs file> \
    --rank 2 \
    --world_size 3 \
    --gpu_devices 0 1

GPU cluster on HPC

  1. Create

#SBATCH --gres=gpu:2
#SBATCH --ntasks-per-node=1
#SBATCH --mem=12000
#SBATCH --time=20:00
#SBATCH --output=log
#SBATCH --ntasks=1
#SBATCH --array=0,1

srun python \
    --init_method file://<absolute path to nfs file> \
    --rank $SLURM_ARRAY_TASK_ID \
    --world_size ${SLURM_ARRAY_TASK_COUNT}\
    --batch_size 256\
    --gpu_devices 0 1
  1. Simply run sbatch on interactive node to submit a job.

Performance comparisons

The code for the first three tests comes from pytorch-multigpu.


  • gpu: p100;
  • gpu cluster: 2 gpus/node;
  • dataset: CIFAR10; batch size: 256; epoch: 1; iters: 196.


method epoch time batch time
single gpu 2:34 1.04s
v1 2:17 0.70s
v2 2:09 0.60s
v3 2:01 0.58s

I didn't expect that the version(v2) of the two processes on a single machine would be slightly slower than the distributed version(v3) of the two nodes. Perhaps, it is due to the HPC. In short, no matter which way you use multiple GPUs, the speed will not increase by a multiple, because of the communication cost of the model synchronization. The only benefit you could get with multi-gpus is a bigger batch size.

Verify the models

This script will verify whether the models from different processes are synchronized.

python final_model_rank_0.pth final_model_rank_1.pth

# output
# layer3.15.bn3.running_var
# layer3.16.bn1.running_mean
# layer3.16.bn1.running_var
# layer3.16.bn2.running_mean
# layer3.16.bn2.running_var
# layer3.16.bn3.running_mean
# layer3.16.bn3.running_var
# bn_out.running_mean
# bn_out.running_var

From the ouput above, the only difference between models is BN layer, because that different minibatches' data does not synchronize. So this three way to facilitate multi gpus cannot improve the performance of the BN. If you want to improve your BN performance, the sync bn may satisfy your demand.


The chinese blogs:


A simple demo of distributed training in Pytorch






No releases published


No packages published
