About this dataset

This dataset was created with two purposes :

Unveil identity biases in a toxicity detection model for written videogame chat
Reflect the type of lines found in a written game chat

The dataset contains a total of 16,008 lines, created from 22 sentence templates and a set of 46 identity-related terms. A full description of the dataset creation method is available in the EMNLP 2023 article.

Structure of the dataset

The dataset contains 10 columns :

chat_line : synthetic chat line made from a sentence template and a term or combination of terms that may convey identity biases
template : the sentence template used to create this chatline
word1, word2 : the words used to fill the tag in the sentence template
lem1, lem2 : the lemmatized version of word1, word2
cat1, cat2 : the categories associated to word1, word2
manual_annotations : toxicity annotations that were obtained from human annotators. Only 1,363 lines have a value in this column.
annotations : the ground truth labels. These labels were obtained from a propagation using a random forest algorithm, trained on the 1,363 manually annotated lines.

For both the columns manual_annotations and annotations :

0 = non-toxic line
1 = toxic line

Cite this dataset

If you use this dataset, please cite the following paper :

Van Dorpe, J., Yang, Z., Grenon-Godbout, N., & Winterstein, G. (2023). Unveiling Identity Biases in Toxicity Detection: A Game-Focused Dataset and Reactivity Analysis Approach. In M. Wang & I. Zitouni (Eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Industry Track (pp. 263–274). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.emnlp-industry.26

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
README.md		README.md
annotated_dataset.csv		annotated_dataset.csv
license.txt		license.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About this dataset

Structure of the dataset

Cite this dataset

About

Releases

Packages

License

ubisoft/Ubisoft-LaForge-ToxPlainerDataSet

Folders and files

Latest commit

History

Repository files navigation

About this dataset

Structure of the dataset

Cite this dataset

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages