mParaRel

This repository contains the code for the paper "Factual Consistency of Multilingual Pretrained Language Models". It extends the original ParaRel 🤘 dataset to a multilingual setting, by using mutliple machine translation systems to translate the templates to 45 languages.

The repository was forked from https://github.com/norakassner/mlama from where we used the translations scripts.

Dataset

Update: You can now find the dataset also in HugginFace Datasets 🤗:

You can find the reviewed templates and the subject-object tuples in the folder data/mpararel_reviewed.

Note that we did not report any numbers in Hindi (even though the data is available) since during the human review it was pointed out that the data looked really noisy.

Reproduce the results

Create an environment and install the requirements

python3 -m venv mpararel-venv
source mpararel-venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt
export PYTHONPATH=$(pwd)

To reproduce the experiments

Get the models predictions

python evaluate_consistency/get_model_predictions.py \
    --mpararel_folder=$WORKDIR/data/mpararel \
    --model_name="bert-base-multilingual-cased" --batch_size=32 \
    --output_folder=$WORKDIR/data/predictions_mpararel/mbert_cased \
    --cpus 10

You can also add the flags --only_languages to get the predictions only for a couple of languages and not all the ones in the mpararel folder, and you can add --add_end_of_sentence_punctuation '.' if you want to experiment adding a sentence-final punctuation.

Evaluate consistency

python evaluate_consistency/run_evaluation.py \
    --predictions_folder=$WORKDIR/data/predictions_mpararel/mbert_cased \
    --mpararel_folder=$WORKDIR/data/mpararel_reviewed_with_tag \
    --mlama_folder=$WORKDIR/data/mlama1.1 \
    --remove_repeated_subjects

You can also add the flags --only_languages zh-hans if you want don't want to get the numbers of all the languages in the mpararel_folder.

Recreate the dataset

To recreate the generation of the dataset follow the steps in dataset/mpararel.sh

Reference

@inproceedings{fierro-sogaard-2022-factual,
    title = "Factual Consistency of Multilingual Pretrained Language Models",
    author = "Fierro, Constanza  and
      S{\o}gaard, Anders",
    booktitle = "Findings of the Association for Computational Linguistics: ACL 2022",
    month = may,
    year = "2022",
    address = "Dublin, Ireland",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.findings-acl.240",
    pages = "3046--3052",
}

Name		Name	Last commit message	Last commit date
Latest commit History 278 Commits
data		data
dataset		dataset
evaluate_consistency		evaluate_consistency
.gitignore		.gitignore
README.md		README.md
logger_utils.py		logger_utils.py
mpararel_utils.py		mpararel_utils.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

mParaRel

Dataset

Reproduce the results

Create an environment and install the requirements

To reproduce the experiments

Recreate the dataset

Reference

Acknowledgements

About

Releases

Packages

Languages

coastalcph/mpararel

Folders and files

Latest commit

History

Repository files navigation

mParaRel

Dataset

Reproduce the results

Create an environment and install the requirements

To reproduce the experiments

Recreate the dataset

Reference

Acknowledgements

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages