BERT pre-training on The Stack

Exploration of BERT-like models trained on The Stack.

Code used to train StarEncoder.
- StarEncoder was fine-tuned for PII detection to pre-process the data used to train StarCoder
This repo also contains functionality to train encoders with contrastive objectives.
More details.

To launch pre-training:

After installing requirements, training can be launched via the example launcher script:

./launcher.sh

--train_data_name can be used to use to set the training dataset.
Hyperparamaters can be changed in exp_configs.py.
- The tokenizer to be used is treated as a hyperparameter and then must also be set in exp_configs.py.
- alpha is used to weigh the BERT losses (NSP+MLM) and the contrastive objective.
  - Setting alpha to 1 corresponds to the standard BERT objective.
- Token masking probabilities are set as separate hyperparameters, one for MLM and another one for the contrastive loss.

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
resources/data/transcoder_evaluation_gfg		resources/data/transcoder_evaluation_gfg
src		src
.gitignore		.gitignore
README.md		README.md
c2c_search_eval.ipynb		c2c_search_eval.ipynb
code_bert_score.ipynb		code_bert_score.ipynb
code_search_eval_oai.py		code_search_eval_oai.py
csn_eval_oai.py		csn_eval_oai.py
embedding_sandbox.ipynb		embedding_sandbox.ipynb
exp_configs.py		exp_configs.py
job_configs.py		job_configs.py
launcher.sh		launcher.sh
padding_ratio.ipynb		padding_ratio.ipynb
requirements.txt		requirements.txt
trainval.py		trainval.py
trainval_toolkit.py		trainval_toolkit.py