Training with pipeline parallelism example #1517
Unanswered
hpc-unex
asked this question in
Community | Q&A
Replies: 1 comment 1 reply
-
Hi, Unex
|
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hello!
I'm trying to train the example provided for pipeline parallelism in (https://github.com/hpcaitech/ColossalAI-Examples/tree/main/features/pipeline_parallel) for CIFAR10 and ResNet50.
I'm running a single node with 2 GPUs but something seems to be wrong with the execution since the accuracy is around 10%~ from the epoch 0 until the infinite. I'm using the "resnet.py" file from the repository with the only change that the processes are launched with mpi:
colossalai.launch(config=CONFIG,
host=None,
port=None,
backend='mpi',
rank = int(os.environ['OMPI_COMM_WORLD_RANK']),
world_size=int(os.environ['OMPI_COMM_WORLD_SIZE']),
local_rank = int(os.environ['OMPI_COMM_WORLD_LOCAL_RANK']),
seed=opt.manualSeed)
This configuration is tested in data and model parallelism and works correctly. Any idea? Someone have tested that example?
Thanks!
Beta Was this translation helpful? Give feedback.
All reactions