Skip to content

bigcode-project/bigcode-encoder

Repository files navigation

BERT pre-training on The Stack

Exploration of BERT-like models trained on The Stack.

  • Code used to train StarEncoder.

    • StarEncoder was fine-tuned for PII detection to pre-process the data used to train StarCoder
  • This repo also contains functionality to train encoders with contrastive objectives.

  • More details.

To launch pre-training:

After installing requirements, training can be launched via the example launcher script:

./launcher.sh

Note that:

  • --train_data_name can be used to use to set the training dataset.

  • Hyperparamaters can be changed in exp_configs.py.

    • The tokenizer to be used is treated as a hyperparameter and then must also be set in exp_configs.py.
    • alpha is used to weigh the BERT losses (NSP+MLM) and the contrastive objective.
      • Setting alpha to 1 corresponds to the standard BERT objective.
    • Token masking probabilities are set as separate hyperparameters, one for MLM and another one for the contrastive loss.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published