This repository is no longer actively maintained. This repo was created when BLOOM+1 paper was written, where we had to engineered the adapter modules due to the new BLOOM architecture.
But now, adapters for BLOOM models are readily available (see peft), and language adaptation of these models (i.e., training of LLMs on monolingual corpora of a particular language) can be done by following official documentations such as peft-blog using the same pretraining objective, next-token-prediction.
This repository contains code for performing language adaptation of multilingual pretrained large language model BLOOM-{560m,1b1,1b7,3b,7b1} to new unseen languages. Please refer to our ACL 2023 paper BLOOM+1: Adding Language Support to BLOOM for Zero-Shot Prompting.
Our implementations support the following features:
- finetuning new tokenizers and embedding layers to support new script of unseen languages.
- different embedding stategies where we replace the entire embedding by training from scratch, reinitialize embedding layers but initialize seen vocabulary, or extend the embedding layer to support new tokens.
- more than 15 language adaptation strategies for pretrained BLOOM model, including continued-pretraining and parameter-efficient finetuning such as BitFit (Zaken et al., 2021), (IA)^3 (Liu et al., 2022), LoRA (Hu et al., 2021), MAD-X (Pfeiffer et al., 2020), composible sparse finetuning (Ansell et al., 2022), etc.
- different evaluation settings:
- supervised fine-tuning or cross-lingual transfer: task-finetuning with (English) task adapters on the following tasks: WikiANN (NER tagging), XLSum (abstractive summarization) and XNLI (natural language inference). This is an artefact that is used for preliminary experiments of our BLOOM+1 work.
- zero-shot prompting on adapted language models, which is carried out on our BLOOM+1 paper. This is done with forked and modified EleutherAI's lm-eval-harness library. See branch
bigscience-lm-adapt
.
- Install the packages from composable-sft. This is used for composable-SFT finetuning.
- Install the packages from rational_activations. You would need to follow the [Other CUDA/PyTorch] section for installation. This is used for adaptable-adapters.
- Install the packages from this repo using
pip install -r requirements.txt
.
If encounter error with the import transformer
, uninstall transformers using the command pip uninstall transformers
and rerun step 3 to reinstall transformers
supported by adapter-transformers
library.
Run tokenized4clm_sampled.py
to train the tokenizer on the subset of OSCAR dataset.
lang
: language name (e.g., "de", "th")model
: original tokenizer (e.g., "bigscience/bloom-1b3")tokenizer_dir
: path directory to save the tokenizer. The tokenizer will be saved astok_${model}_${lang}_oscar_${sample_size}samples_${vocab_size}vocab_{replace/extend}
cache_dir
(default is "~/.cache/huggingface/transformers"): cache directory for downloading the OSCAR dataset and GPT2 tokenizer.vocab_size
: vocab size of the tokenizersample_size
: the amount of samples to use to train the tokenizer (randomly selected)tok_strategy
: extend, replace or overlap-replace
cache_dir=...
output_dir=...
lang=... # language
sample_size=... # training sample size
vocab_size=... # vocab size of tokenizer
tok_strategy=... # extend, replace, overlap-replace
bigs_model="bigscience/bloom-1b3"
tokenizer_dir="${output_dir}/tok_$(basename $bigs_model)_${lang}_oscar_${sample_size}samples_${vocab_size}vocab_${tok_strategy}"
python ./scripts/lang_adapt/tokenized4clm_sampled.py \
--lang $lang \
--model $bigs_model \
--tokenizer_dir $tokenizer_dir \
--hf_cache_dir $cache_dir \
--vocab_size $vocab_size \
--sample_size $sample_size \
--tok_strategy $tok_strategy
Run madx_run_clm.py
to finetune language model on a new language.
LANG
: language name (e.g., "de", "th") on OSCARDATA_SAMPLES
: training sample sizeVOCAB_SIZE
: vocab size of the tokenizerBIGS_MODEL
: bigscience modelADPT_STRATEGY
: language adaptation strategy"emb"
: train only embedding"continual-pretrain"
: continued pretraining of the entire BLOOM model"emb-then-adpt"
: train embedding then Pfeiffer adapter later (sequential training)"pfeiffer"
,"pfeiffer+inv"
: Pfeiffer adapters in transformers block. (Houlsby et al., 2019) Without or with invertible adapters in embedding layer. This is also known as MAD-X (Pfeiffer et al., 2020)."lora"
: LoRA adapters in transformers block (Hu et al., 2021)"aa"
: adaptable adapters (Moosavi et al., 2022)"ia3"
,"ia3+inv"
: (IA)^3 adapters in transformers block. Without or with invertible adapters in embedding layer. (Liu et al., 2022)"prefix_tuning"
,"prefix_tuning_flat"
: Prefix tuning in input space, whether using MLP layers to initialize (withoutflat
) or directly initialize tokens (withflat
) as prefix tokens. (Li & Liang, 2021)"prompt-tuning"
: Prompt-tuning in transformer blocks (Lester et al., 2021)"sft"
: Composable sparse finetuning. (Ansell et al., 2022)"bitfit"
,"bitfit+inv"
: Finetuning bias layers. Without or with invertible adapters in embedding layer. (Zaken et al., 2021)"fish"
: Finetuning FISH masks. (Sung et al., 2021)"compacter"
,"compacterpp"
: Compacter or compacter++ adapters in transformer blocks. (Mahabadi et al., 2021)
EMBD_SRATEGY
: embedding strategy. Either"replace"
(replace the embedding layer entirely),"overlap-replace"
(replace but initialize seen vocab with pretrained embedding), or"extend"
(freeze seen vocab embeddings and add trainable embeddings for unseen vocab)TOK_STRATEGY
: tokenization strategy (either"replace"
(for embedding strategy of "replace" and "overlap-replace") or"extend"
)tokenizer_dir
: saved tokenizer directory (used in the tokenization script above)cache_dir
: (as above)output_dir
: directory to save adapted modellogging_dir
: directory to log loss curves to tensorboardMAX_STEPS
: training stepsEVAL_STEPS
: number of training steps between two evaluationsSAVE_STEPS
: number of training steps between saving the checkpoints.
LANG=... # language
DATA_SAMPLES=... # training sample size
VOCAB_SIZE=... # vocab size of newly trained tokenizer
BIGS_MODEL="bigscience/bloom-1b3"
ADPT_STRATEGY="emb" # language adaptation strategy (train only embedding for now)
EMBD_SRATEGY=... # either "replace", "overlap-replace", or "extend"
TOK_STRATEGY=... # either "replace" (for embedding strategy of "replace" and "overlap-replace") or "extend"
tokenizer_dir=... # as above
tokenizer_dir="${tokenizer_dir}/tok_${BIGS_MODEL##*/}_${LANG}_oscar_${DATA_SAMPLES}samples_${VOCAB_SIZE}vocab_${TOK_STRATEGY}"
cache_dir=... # as above
output_dir=... # directory to save adapted model
output_dir="${output_dir}/${BIGS_MODEL##*/}_${LANG}_${ADPT_STRATEGY}_${DATA_SAMPLES}samples_${VOCAB_SIZE}vocab_${EMBD_SRATEGY}"
logging_dir=... # directory to log loss curves to tensorboard
logging_dir="${logging_dir}/${BIGS_MODEL##*/}_${LANG}_${ADPT_STRATEGY}_${DATA_SAMPLES}samples_${VOCAB_SIZE}vocab_${EMBD_SRATEGY}"
mkdir -p $output_dir
mkdir -p $logging_dir
MAX_STEPS=50000
EVAL_STEPS=5000
SAVE_STEPS=5000
python ./scripts/lang_adapt/madx_run_clm.py \
--seed 0 \
--fp16 \
--model_name_or_path $BIGS_MODEL \
--tokenizer_name $tokenizer_dir \
--dataset_name oscar \
--cache_dir $cache_dir \
--dataset_config_name "unshuffled_deduplicated_${LANG}" \
--logging_dir $logging_dir \
--report_to "tensorboard" \
--learning_rate 0.001 \
--do_train \
--do_eval \
--output_dir $output_dir \
--preprocessing_num_workers 8 \
--overwrite_output_dir \
--per_device_train_batch_size 2 \
--gradient_accumulation_steps 4 \
--per_device_eval_batch_size 2 \
--eval_accumulation_steps 4 \
--eval_steps $EVAL_STEPS \
--evaluation_strategy "steps" \
--max_eval_samples 5000 \
--save_steps $SAVE_STEPS \
--save_strategy "steps" \
--max_train_samples $DATA_SAMPLES \
--max_steps $MAX_STEPS \
--logging_steps 1000 \
--lang_adapt_strategies $ADPT_STRATEGY \
--embedding_strategies $EMBD_SRATEGY \
--load_best_model_at_end \
--gradient_checkpointing \
--fp16
BLOOM+1 Reproduction: See ./scripts/lang_adapt/example_scripts/run_clm_ru_madx_560m.sh
to reproduce language adapation of BLOOM-560m models to Russian in our BLOOM+1 paper.
- Replace
python ./scripts/lang_adapt/madx_run_clm.py
withdeepspeed --num_gpus=8 --master_port 60000
. - Pass deepspeed config file argument
--deepspeed "/home/zhengxinyong/multilingual-modeling/scripts/lang_adapt/ds_config_zero2.json"
See example file at ./scripts/lang_adapt/example_scripts/run_clm_ru_madx_7b1_deepspeed.sh
, which adapts BLOOM-7b1 model on Google Cloud 8 A100 GPUs.
Prompt the adapted language model in a zero-shot fashion without any finetuning. You'll need to git clone https://github.com/yongzx/lm-evaluation-harness/tree/bigscience-lm-adapt
to be able to run the experiments.
Here shows the evaluation code for XNLI zero-shot prompting. You can find it in lm-evaluation-harness/examples/
.
For BLOOM+1, the tasks used are:
xnli
(XNLI: Evaluating Cross-lingual Sentence Representations)amnli
(AmericasNLI: Evaluating Zero-shot Natural Language Understanding of Pretrained Multilingual Models in Truly Low-resource Languages)pawsx
(PAWS-X: A Cross-lingual Adversarial Dataset for Paraphrase Identification)xcopa
(XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning)xstory
(Multilingual Story Cloze Test and ROCStories Corpora)xwino
(Wino-X: Multilingual Winograd Schemas for Commonsense Reasoning and Coreference Resolution)
Baseline or Model-Based (BitFit, FISH Mask, etc.)
python3 lm-evaluation-harness/main.py \
--model bigscience \
--model_args tokenizer="bigscience/bloom-560m",pretrained="ZYONG2/saved_models/bloom-560m_de_bitfit_100000samples_-1vocab_original-frozen" \
--tasks xnli_de
Using Adapters (MAD-X, Pfeiffer, IA3, LoRA, etc.)
python3 m-evaluation-harness/main.py \
--model bigscience \
--model_args tokenizer="bigscience/bloom-560m",pretrained="bigscience/bloom-560m",adapter_ckpt_folder="ZYONG2/saved_models/bloom-560m_de_ia3_100000samples_-1vocab_original-frozen/oscar_ia3_de" \
--tasks xnli_de
Supervised Finetuning or Cross-Lingual Transfer (Only used for preliminary experiments with BLOOM is released)
OUTPUT_DIR=... # where you want to save checkpoints at
LANG="de"
CACHE_DIR=... # cache dir for saving/loading HF models and XNLI datasets.
LR=1e-5
MODEL_NAME="ZYONG2/bigscience/tr5b-1B3-multilingual-alpha-checkpoints" # previous version of BLOOM pre-release
TOKENIZER_NAME="ZYONG2/processed/011/oscar-de-tokenizer"
# language adapters checkpoint folder
MADX_LANG_ADAPTER_NAME=".../oscar_de"
# we finetune task adapters for XNLI
FT_STRATEGIES="task_adapters"
mkdir -p $OUTPUT_DIR
python adapters_xnli_de.py \
$OUTPUT_DIR \
--lang $LANG \
--cache_dir $CACHE_DIR \
--num_train_epochs 2 \
--learning_rate $LR \
--per_device_train_batch_size 8 \
--gradient_accumulation_steps 4 \
--pretrained_model $MODEL_NAME \
--tokenizer $TOKENIZER_NAME \
--do_train \
--do_eval_after_train \
--madx_lang_adapter $MADX_LANG_ADAPTER_NAME \
--finetune_strategies $FT_STRATEGIES \
--zero_shot
Remove --zero_shot
for supervised finetuning setting.
See example scripts in ./scripts/eval/task_ftscripts_xnli/
. train_xnli_zero_shot.sh
is the batch script for XNLI finetuning, and run_eval_xnli_zero_shot.sh
is for evaluating trained XNLI task adapters.
@inproceedings{yong-etal-2023-bloom,
title = "{BLOOM}+1: Adding Language Support to {BLOOM} for Zero-Shot Prompting",
author = "Yong, Zheng Xin and Schoelkopf, Hailey and Muennighoff, Niklas and Aji, Alham Fikri and Adelani, David Ifeoluwa and Almubarak, Khalid and Bari, M Saiful and Sutawika, Lintang and Kasai, Jungo and Baruwa, Ahmed and Winata, Genta and Biderman, Stella and Raff, Edward and Radev, Dragomir and Nikoulina, Vassilina",
editor = "Rogers, Anna and Boyd-Graber, Jordan and Okazaki, Naoaki",
booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = jul,
year = "2023",
address = "Toronto, Canada",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.acl-long.653",
doi = "10.18653/v1/2023.acl-long.653",
pages = "11682--11703",
}