- 📄 Sequence Parallel
In this codebase, we implemented BERT with sequence parallelism. Sequence parallelism splits the input tensor and intermediate activation along the sequence dimension. This method can achieve better memory efficiency and allows us to train with larger batch size and longer sequence length.
Paper: Sequence Parallelism: Long Sequence Training from System Perspective
To run this codebase, the following environment is required:
- CUDA: 11.3
- Python: > 3.7
- PyTorch: 1.11.0
You can follow the script below to set up the environment for this codebase.
# create conda environment
conda create -n seq python=3.9
conda activate seq
# install PyTorch
pip install torch==1.11.0+cu113 torchvision==0.12.0+cu113 torchaudio==0.11.0 --extra-index-url https://download.pytorch.org/whl/cu113
# install nvidia apex
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" git+https://github.com/NVIDIA/apex.git@22.03
# install apex
pip install -v -r requirements.txt
For this codebase, we will use the Wikipedia dataset and process the dataset with Megatron-LM's preprocessing script.
You can use the following script to download and extract the Wikipedia dataset. The extracted corpus will be stored in ./dataset/raw/corpus.json
.
Execute the scripts in the root directory of this codebase
# go to the root directory
cd Sequence-Parallelism
# create dataset workspace
mkdir dataset && cd ./dataset
# download
mkdir raw && cd ./raw
wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
# install wiki extractor
pip install git+https://github.com/FrankLeeeee/wikiextractor.git@v3.0.6
# extract text
wikiextractor --json enwiki-latest-pages-articles.xml.bz2
cat text/*/* > ./corpus.json
The following commands can be used to process the Wikipedia dataset. Many thanks to Megatron-LM for providing a preprocessing script to generate the corpus file.
# go to the root directory
cd Sequence-Parallelism
# download vocab file
mkdir vocab && cd ./vocab
wget https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-vocab.txt
cd ..
# clone the Megatron-LM repository
git clone https://github.com/NVIDIA/Megatron-LM.git
cd ./Megatron-LM
git checkout b037a69eb2622bd824d88cc230ada94077b3716f
# run processing
python tools/preprocess_data.py \
--input ../dataset/raw/corpus.json \
--output-prefix my-bert \
--vocab ../vocab/bert-large-uncased-vocab.txt \
--dataset-impl mmap \
--tokenizer-type BertWordPieceLowerCase \
--split-sentences \
--workers 48
# move the processed data to the dataset directory
cd ..
mkdir -p dataset/processed
mv Megatron-LM/my-bert_text_sentence.* ./dataset/processed
After running these commands, you should see the following files in dataset/processed
:
- my-bert_text_sentence.bin
- my-bert_text_sentence.idx
We provided train.py
for you to execute training. Before invoking the script, there are several steps to perform.
At the top of config.py
, you can see two global variables DATA_PATH
and VOCAB_FILE_PATH
.
DATA_PATH = './dataset/processed/my-bert_text_sentence'
VOCAB_FILE_PATH = './vocab/bert-large-uncased-vocab.txt'
DATA_PATH
refers to the path to the data file generated by Megatron's script. For example, in the section above, you should get two data files (my-bert_text_sentence.bin and my-bert_text_sentence.idx). You just need to DATA_PATH
to the path to the bin file without the file extension.
The VOCAB_FILE_PATH
refers to the path to the vocabulary downloaded when you prepare the dataset
(e.g. bert-large-uncased-vocab.txt).
Build BERT dataset helper. Requirements are CUDA
, g++
, pybind11
and make
.
cd ./data/datasets
make
In the config.py
provided, a set of parameters are defined including training scheme, model, etc.
You can also modify the ColossalAI setting. For example, if you wish to parallelize over the
sequence dimension on 8 GPUs. You can change size=4
to size=8
. If you wish to use pipeline parallelism, you can set pipeline=<num_of_pipeline_stages>
.
Lastly, you can start training with sequence parallelism with the following command. You need to replace <num-of-gpus>
with the number of GPUs you want to use.
python -m torch.distributed.launch --nproc_per_node <num-of-gpus> --master_addr localhost --master_port 29500 train.py