Paper link - https://arxiv.org/abs/2303.16445
Datasets are available on 🤗
from datasets import load_dataset
dataset = load_dataset("text-machine-lab/NEG-1500-SIMP-TEMP")
dataset = load_dataset("text-machine-lab/NEG-1500-SIMP-GEN")
dataset = load_dataset("text-machine-lab/ROLE-1500")
Download datasets from 🤗 at https://huggingface.co/text-machine-lab
In this work, we introduce new, larger datasets for negation (NEG-1500-SIMP) and role reversal (ROLE-1500) inspired by psycholinguistic studies from Ettinger (2020). We dramatically extend existing NEG-136 and ROLE-88 benchmarks using GPT3, increasing their size from 18 and 44 sentence pairs to 750 each. We also create another version of extended negation dataset (NEG-1500-SIMP-TEMP), created using template-based generation. It consists of 770 sentence pairs. We evaluate 22 models on the generated datasets.
The new datasets can be found at data/
:
NEG-1500-SIMP-TEMP
(generated using a template)NEG-1500-SIMP-GEN
(generated by GPT-3)ROLE-1500
(generated by GPT-3)
original_datasets
contains original datasets from Ettinger (2020). Categories and subcategories are listed in neg_simp_categories.csv
which were used to create NEG-1500-SIMP-TEMP
.
pip install -r requirements.txt
The codebase supports
- BERT -
bert-base-uncased
,bert-large-uncased
- RoBERTa -
roberta-base
,roberta-large
- DistilBERT -
distilbert-base-uncased
- AlBERT -
albert-base-v1
,albert-large-v1
,albert-xl-v1
,albert-xxl-v1
albert-base-v2
,albert-large-v2
,albert-xl-v2
,albert-xxl-v2
- T5 -
t5-small
,t5-base
,t5-large
,t5-xl
(3b), - GPT2 -
gpt2
,gpt2-medium
,gpt2-large
,gpt2-xl
- GPT-3 -
text-davinci-002
-
To evaluate a particular model, from above list(except GPT-3) on the extended dataset, run
python src/evaluation.py [FILEPATH] [MODELNAME] --prediction [True or False] e.g. python src/evaluation.py 'data/NEG-1500-SIMP-GEN.txt' bert-base-uncased python src/evaluation.py 'data/ROLE-1500.txt' bert-base-uncased
State Prediction
as True
if we don't have predictions saved from previous runs. State Prediction
as False
if we already have the predictions and only need to compute accuracy.
Default value is False.
-
for
GPT-3
runpython src/evaluation.py [FILEPATH] gpt3 --key [OPNEAI KEY] --prediction [True or False]
Save OPENAI KEY as environment variable.
-
To evaluate all models run
python src/evaluate_all_models.py [FILENAME] --prediction [True or False] e.g. python src/evaluate_all_models.py 'data/NEG-1500-SIMP.txt'
The top 20 predictions for extended datasets are saved in predictions/
. Results are saved in result/
. Each file has model name and top 20 | top 10 | top 5 | top 1 prediction accuracy. sensitivity-neg-all.txt
has the negation sensitivity result. Sensisitivity stands for number of times the target word flipped when a not
is added to the affirmative sentence. The sensitvity is #flipped/#affirmative sentences
.
#affirmative sentences for NEG-1500-SIMP-TEMP = 770
#affirmative sentences for NEG-1500-SIMP-GEN = 750
#affirmative sentences for NEG-136-SIMP = 18
To avoid overwritten the results, the `result/' has the results for previous runs.
To extend the NEG-136-SIMP dataset run
python src/generate_data.py --dataset [DATASET NAME] --key [OPENAI KEY]
DATASET NAME can be: negsimp_template
, negsimp_gen
, role
OPENAI KEY is None for negsimp_template
, otherwise save it as environment variable.
For role, number of in-context samples is 4 and number of times the script prompts GPT-3 to get the response, is set to default value of 10. It can generate around 40 sentence pairs. Once the samples are generated, they are manually cleaned.
The generated datasets are saved in data/
.
Battig, William F. and William Edward Montague. “Category norms of verbal items in 56 categories A replication and extension of the Connecticut category norms.” Journal of Experimental Psychology 80 (1969): 1-46.
Ettinger, A. (2020). What BERT is not: Lessons from a new suite of psycholinguistic diagnostics for language models. Transactions of the Association for Computational Linguistics, 8, 34-48. https://arxiv.org/abs/1907.13528
Paper link - https://arxiv.org/abs/2303.16445
@article{shivagunde2023larger,
title={Larger Probes Tell a Different Story: Extending Psycholinguistic Datasets Via In-Context Learning},
author={Shivagunde, Namrata and Lialin, Vladislav and Rumshisky, Anna},
journal={arXiv preprint arXiv:2303.16445},
year={2023}
}