Larger Probes Tell a Different Story: Extending Psycholinguistic Datasets Via In-Context Learning

Paper link - https://arxiv.org/abs/2303.16445

Datasets are available on 🤗

from datasets import load_dataset

dataset = load_dataset("text-machine-lab/NEG-1500-SIMP-TEMP")
dataset = load_dataset("text-machine-lab/NEG-1500-SIMP-GEN")
dataset = load_dataset("text-machine-lab/ROLE-1500")

Download datasets from 🤗 at https://huggingface.co/text-machine-lab

In this work, we introduce new, larger datasets for negation (NEG-1500-SIMP) and role reversal (ROLE-1500) inspired by psycholinguistic studies from Ettinger (2020). We dramatically extend existing NEG-136 and ROLE-88 benchmarks using GPT3, increasing their size from 18 and 44 sentence pairs to 750 each. We also create another version of extended negation dataset (NEG-1500-SIMP-TEMP), created using template-based generation. It consists of 770 sentence pairs. We evaluate 22 models on the generated datasets.

A. Datasets

The new datasets can be found at data/:

NEG-1500-SIMP-TEMP (generated using a template)
NEG-1500-SIMP-GEN (generated by GPT-3)
ROLE-1500 (generated by GPT-3)

original_datasets contains original datasets from Ettinger (2020). Categories and subcategories are listed in neg_simp_categories.csv which were used to create NEG-1500-SIMP-TEMP.

B. Evaluation

B.1 Install libraries

pip install -r requirements.txt

B.2 Evaluate models

The codebase supports

BERT - bert-base-uncased , bert-large-uncased
RoBERTa - roberta-base , roberta-large
DistilBERT - distilbert-base-uncased
AlBERT - albert-base-v1, albert-large-v1, albert-xl-v1 ,albert-xxl-v1 albert-base-v2, albert-large-v2, albert-xl-v2 ,albert-xxl-v2
T5 - t5-small, t5-base, t5-large, t5-xl (3b),
GPT2 - gpt2, gpt2-medium, gpt2-large, gpt2-xl
GPT-3 - text-davinci-002

To evaluate a particular model, from above list(except GPT-3) on the extended dataset, run

 python src/evaluation.py [FILEPATH] [MODELNAME] --prediction [True or False]
 
 e.g.
 python src/evaluation.py 'data/NEG-1500-SIMP-GEN.txt' bert-base-uncased 
 python src/evaluation.py 'data/ROLE-1500.txt' bert-base-uncased

State Prediction as True if we don't have predictions saved from previous runs. State Prediction as False if we already have the predictions and only need to compute accuracy. Default value is False.

for GPT-3 run

 python src/evaluation.py [FILEPATH] gpt3 --key [OPNEAI KEY] --prediction [True or False]

Save OPENAI KEY as environment variable.

To evaluate all models run

 python src/evaluate_all_models.py [FILENAME] --prediction [True or False]
 
 e.g.
 python src/evaluate_all_models.py 'data/NEG-1500-SIMP.txt'

B.3 Predictions and results

The top 20 predictions for extended datasets are saved in predictions/. Results are saved in result/. Each file has model name and top 20 | top 10 | top 5 | top 1 prediction accuracy. sensitivity-neg-all.txt has the negation sensitivity result. Sensisitivity stands for number of times the target word flipped when a not is added to the affirmative sentence. The sensitvity is #flipped/#affirmative sentences.

#affirmative sentences for NEG-1500-SIMP-TEMP = 770

#affirmative sentences for NEG-1500-SIMP-GEN = 750

#affirmative sentences for NEG-136-SIMP = 18

To avoid overwritten the results, the `result/' has the results for previous runs.

C. Data generation

To extend the NEG-136-SIMP dataset run

    python src/generate_data.py --dataset [DATASET NAME] --key [OPENAI KEY]

DATASET NAME can be: negsimp_template, negsimp_gen, role OPENAI KEY is None for negsimp_template, otherwise save it as environment variable.

For role, number of in-context samples is 4 and number of times the script prompts GPT-3 to get the response, is set to default value of 10. It can generate around 40 sentence pairs. Once the samples are generated, they are manually cleaned.

The generated datasets are saved in data/.

References

Battig, William F. and William Edward Montague. “Category norms of verbal items in 56 categories A replication and extension of the Connecticut category norms.” Journal of Experimental Psychology 80 (1969): 1-46.

Ettinger, A. (2020). What BERT is not: Lessons from a new suite of psycholinguistic diagnostics for language models. Transactions of the Association for Computational Linguistics, 8, 34-48. https://arxiv.org/abs/1907.13528

Paper link and Citation

Paper link - https://arxiv.org/abs/2303.16445

  @article{shivagunde2023larger,
    title={Larger Probes Tell a Different Story: Extending Psycholinguistic Datasets Via In-Context Learning},
    author={Shivagunde, Namrata and Lialin, Vladislav and Rumshisky, Anna},
    journal={arXiv preprint arXiv:2303.16445},
    year={2023}
  }

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Larger Probes Tell a Different Story: Extending Psycholinguistic Datasets Via In-Context Learning

A. Datasets

B. Evaluation

B.1 Install libraries

B.2 Evaluate models

B.3 Predictions and results

C. Data generation

References

Paper link and Citation

Files

README.md

Latest commit

History

README.md

File metadata and controls

Larger Probes Tell a Different Story: Extending Psycholinguistic Datasets Via In-Context Learning

A. Datasets

B. Evaluation

B.1 Install libraries

B.2 Evaluate models

B.3 Predictions and results

C. Data generation

References

Paper link and Citation