Skip to content

Latest commit

 

History

History
118 lines (72 loc) · 4.9 KB

File metadata and controls

118 lines (72 loc) · 4.9 KB

Larger Probes Tell a Different Story: Extending Psycholinguistic Datasets Via In-Context Learning

Paper link - https://arxiv.org/abs/2303.16445

Datasets are available on 🤗

from datasets import load_dataset

dataset = load_dataset("text-machine-lab/NEG-1500-SIMP-TEMP")
dataset = load_dataset("text-machine-lab/NEG-1500-SIMP-GEN")
dataset = load_dataset("text-machine-lab/ROLE-1500")

Download datasets from 🤗 at https://huggingface.co/text-machine-lab

In this work, we introduce new, larger datasets for negation (NEG-1500-SIMP) and role reversal (ROLE-1500) inspired by psycholinguistic studies from Ettinger (2020). We dramatically extend existing NEG-136 and ROLE-88 benchmarks using GPT3, increasing their size from 18 and 44 sentence pairs to 750 each. We also create another version of extended negation dataset (NEG-1500-SIMP-TEMP), created using template-based generation. It consists of 770 sentence pairs. We evaluate 22 models on the generated datasets.

A. Datasets

The new datasets can be found at data/:

  • NEG-1500-SIMP-TEMP (generated using a template)
  • NEG-1500-SIMP-GEN (generated by GPT-3)
  • ROLE-1500 (generated by GPT-3)

original_datasets contains original datasets from Ettinger (2020). Categories and subcategories are listed in neg_simp_categories.csv which were used to create NEG-1500-SIMP-TEMP.

B. Evaluation

B.1 Install libraries

pip install -r requirements.txt

B.2 Evaluate models

The codebase supports

  • BERT - bert-base-uncased , bert-large-uncased
  • RoBERTa - roberta-base , roberta-large
  • DistilBERT - distilbert-base-uncased
  • AlBERT - albert-base-v1, albert-large-v1, albert-xl-v1 ,albert-xxl-v1 albert-base-v2, albert-large-v2, albert-xl-v2 ,albert-xxl-v2
  • T5 - t5-small, t5-base, t5-large, t5-xl (3b),
  • GPT2 - gpt2, gpt2-medium, gpt2-large, gpt2-xl
  • GPT-3 - text-davinci-002
  1. To evaluate a particular model, from above list(except GPT-3) on the extended dataset, run

     python src/evaluation.py [FILEPATH] [MODELNAME] --prediction [True or False]
     
     e.g.
     python src/evaluation.py 'data/NEG-1500-SIMP-GEN.txt' bert-base-uncased 
     python src/evaluation.py 'data/ROLE-1500.txt' bert-base-uncased
    

State Prediction as True if we don't have predictions saved from previous runs. State Prediction as False if we already have the predictions and only need to compute accuracy. Default value is False.

  1. for GPT-3 run

     python src/evaluation.py [FILEPATH] gpt3 --key [OPNEAI KEY] --prediction [True or False]
    

Save OPENAI KEY as environment variable.

  1. To evaluate all models run

     python src/evaluate_all_models.py [FILENAME] --prediction [True or False]
     
     e.g.
     python src/evaluate_all_models.py 'data/NEG-1500-SIMP.txt'
    

B.3 Predictions and results

The top 20 predictions for extended datasets are saved in predictions/. Results are saved in result/. Each file has model name and top 20 | top 10 | top 5 | top 1 prediction accuracy. sensitivity-neg-all.txt has the negation sensitivity result. Sensisitivity stands for number of times the target word flipped when a not is added to the affirmative sentence. The sensitvity is #flipped/#affirmative sentences.

#affirmative sentences for NEG-1500-SIMP-TEMP = 770

#affirmative sentences for NEG-1500-SIMP-GEN = 750

#affirmative sentences for NEG-136-SIMP = 18

To avoid overwritten the results, the `result/' has the results for previous runs.

C. Data generation

To extend the NEG-136-SIMP dataset run

    python src/generate_data.py --dataset [DATASET NAME] --key [OPENAI KEY]

DATASET NAME can be: negsimp_template, negsimp_gen, role OPENAI KEY is None for negsimp_template, otherwise save it as environment variable.

For role, number of in-context samples is 4 and number of times the script prompts GPT-3 to get the response, is set to default value of 10. It can generate around 40 sentence pairs. Once the samples are generated, they are manually cleaned.

The generated datasets are saved in data/.

References

Battig, William F. and William Edward Montague. “Category norms of verbal items in 56 categories A replication and extension of the Connecticut category norms.” Journal of Experimental Psychology 80 (1969): 1-46.

Ettinger, A. (2020). What BERT is not: Lessons from a new suite of psycholinguistic diagnostics for language models. Transactions of the Association for Computational Linguistics, 8, 34-48. https://arxiv.org/abs/1907.13528

Paper link and Citation

Paper link - https://arxiv.org/abs/2303.16445

  @article{shivagunde2023larger,
    title={Larger Probes Tell a Different Story: Extending Psycholinguistic Datasets Via In-Context Learning},
    author={Shivagunde, Namrata and Lialin, Vladislav and Rumshisky, Anna},
    journal={arXiv preprint arXiv:2303.16445},
    year={2023}
  }