This repository hosts the codebase for implementing all the methods proposed in the paper entitled "Time Travel in LLMs: Tracing Data Contamination in Large Language Models," authored by Shahriar Golchin* and Mihai Surdeanu.
Explore more resources related to this paper: video, poster, and media.
Our research is the first to systematically uncover and detect the issue of data contamination in the fully black-box large language models (LLMs). The primary idea revolves around the fact that if an LLM has seen a dataset instance during its pre-training phase, the LLM is able to replicate it. This is supported by two observations: (1) LLMs have enough capacity to memorize data; and (2) LLMs are trained to follow instructions effectively. However, due to the safety filters implemented in LLMs to prevent them from generating copyrighted content, explicitly asking LLMs to reproduce these instances is ineffective, as it triggers safety mechanisms. Our method circumvents these filters by replicating dataset instances given their random-length initial segments. Below is an example of our strategy in action, whereby the subsequent segment of an instance from the train split of the IMDB dataset is exactly replicated by GPT-4.
Start the process by cloning the repository using the command below:
git clone https://github.com/shahriargolchin/time-travel-in-llms.git
Make sure that you are in the project's directory. If not, navigate to the project's root directory by executing the following command:
cd time-travel-in-llms
Next, establish a virtual environment:
python3.11 -m venv time-travel-venv
Now, activate your environment:
source time-travel-venv/bin/activate
Lastly, use pip to install all the requisite packages:
pip install -r requirements.txt
Important
Note that the aforementioned command installs packages necessary for running evaluations via ROUGE-L and GPT-4 in-context learning (ICL). For evaluation using BLEURT, additional installation is required since it is used as a dependency for this project. To do this, execute the following commands or refer to the BLEURT repository, but ensure it is located in dependencies/bleurt_scorer
directory in this project. You may skip these steps if you do not need to perform evaluation using BLEURT.
git clone https://github.com/google-research/bleurt.git dependencies/bleurt_scorer
cd dependencies/bleurt_scorer
pip install .
Then, download the model checkpoint for BLEURT by running the following command (note that we used the BLEURT-20
checkpoint for our study, and the provided command downloads this particular checkpoint. You can use any other checkpoint from the list available here.):
wget https://storage.googleapis.com/bleurt-oss-21/BLEURT-20.zip
unzip BLEURT-20.zip
Alternatively, if you do not have wget
installed, you can use the following command as an alternative:
curl -O https://storage.googleapis.com/bleurt-oss-21/BLEURT-20.zip
unzip BLEURT-20.zip
For all the settings discussed in the paper, we have provided the corresponding bash files in the scripts
directory. Upon running these bash scripts, data contamination is detected for the examined subset of data. In the results
directory, individual text files are generated for each evaluation method, i.e., ROUGE-L, BLEURT, or GPT-4 ICL, to display pass/fail results for the detected contamination. The input CSV files, along with all the intermediate results, are also stored in the corresponding subdirectories under the results
directory.
Before running experiments, you need to export your OpenAI key to ensure that OpenAI models are accessible. You can do so by using the following command:
export OPENAI_API_KEY=your-api-key
To run an experiment, first navigate to the scripts/dataset-name
directory where bash scripts for each partition of a dataset (e.g., train, test/validation) are located. You can do this with the below command (assuming you are in the root directory):
cd scripts/dataset-name
Once in the respective directory, set the bash file to executable by running the following command:
chmod +x bash-file-name.sh
Finally, run the experiment by executing:
./bash-file-name.sh
@article{DBLP:journals/corr/abs-2308-08493,
author = {Shahriar Golchin and
Mihai Surdeanu},
title = {Time Travel in LLMs: Tracing Data Contamination in Large Language
Models},
journal = {CoRR},
volume = {abs/2308.08493},
year = {2023},
url = {https://doi.org/10.48550/arXiv.2308.08493},
doi = {10.48550/ARXIV.2308.08493},
eprinttype = {arXiv},
eprint = {2308.08493},
timestamp = {Thu, 24 Aug 2023 12:30:27 +0200},
biburl = {https://dblp.org/rec/journals/corr/abs-2308-08493.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
If you are interested in the field of data contamination detection in LLMs, you might find our second paper, Data Contamination Quiz: A Tool to Detect and Estimate Contamination in Large Language Models (repo available here), particularly useful. In this paper, we present a novel method not only for detecting contamination in LLMs but also for estimating its amount in fully black-box LLMs. For reference, you can cite this paper using the standard citation format provided below:
@article{DBLP:journals/corr/abs-2311-06233,
author = {Shahriar Golchin and
Mihai Surdeanu},
title = {Data Contamination Quiz: {A} Tool to Detect and Estimate Contamination
in Large Language Models},
journal = {CoRR},
volume = {abs/2311.06233},
year = {2023},
url = {https://doi.org/10.48550/arXiv.2311.06233},
doi = {10.48550/ARXIV.2311.06233},
eprinttype = {arXiv},
eprint = {2311.06233},
timestamp = {Wed, 15 Nov 2023 16:23:10 +0100},
biburl = {https://dblp.org/rec/journals/corr/abs-2311-06233.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}