Multilingual Fake News Dataset created for the research paper "A Transformer Based Approach to Multilingual Fake News Detection in Low Resource Languages" accepted at the ACM Transactions on Asian and Low-Resource Language Information Processing (ACM TALLIP)
This Dataset is associated with the Research Work titled "A Transformer Based Approach to Multilingual Fake News Detection" published in ACM Transaction on Asian and Low-Resource Language Information Processing (ACM-TALLIP) Journal by Arkadipta De, Dibyanayan Bandyopadhyay, Baban Gain and Asif Ekbal in a joint research work from IIT Hyderabad and IIT Patna.
The dataset is available in the link: http://www.iitp.ac.in/~ai-nlp-ml/resources/data/TALLIP-FakeNews-Dataset.zip
The paper can be found at: https://dl.acm.org/doi/abs/10.1145/3472619
- English Version of Dataset (Train and Test)
- Hindi Version of Dataset (Train and Test)
- Swahili Version of Dataset (Train and Test)
- Vietnamese Version of Dataset (Train and Test)
- Indonesian Version of Dataset (Train and Test)
- Multilingual Version of Dataset (Train and Test) Each Dataset has Six different domains (Technology, Bussiness, Education, Politics, Celebrity News, Entertainment)
- The English Version of the dataset has been collected, cleaned and processed from the research paper "Automatic Detection of Fake News" by Verónica Pérez-Rosas, Bennett Kleinberg, Alexandra Lefevre, Rada Mihalcea published Proceedings of the 27th International Conference on Computational Linguistics (COLING 2018). The Paper URL: https://www.aclweb.org/anthology/C18-1287. If you use only the English version of the dataset then please cite the paper given in the URL.
- The extension of the dataset has been done by the authors of this paper. If you use this dataset in any research work, please cite the paper
@article{10.1145/3472619,
author = {De, Arkadipta and Bandyopadhyay, Dibyanayan and Gain, Baban and Ekbal, Asif},
title = {A Transformer-Based Approach to Multilingual Fake News Detection in Low-Resource Languages},
year = {2021},
issue_date = {January 2022},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
volume = {21},
number = {1},
issn = {2375-4699},
url = {https://doi.org/10.1145/3472619},
doi = {10.1145/3472619},
abstract = {Fake news classification is one of the most interesting problems that has attracted huge attention to the researchers of artificial intelligence, natural language processing, and machine learning (ML). Most of the current works on fake news detection are in the English language, and hence this has limited its widespread usability, especially outside the English literate population. Although there has been a growth in multilingual web content, fake news classification in low-resource languages is still a challenge due to the non-availability of an annotated corpus and tools. This article proposes an effective neural model based on the multilingual Bidirectional Encoder Representations from Transformer (BERT) for domain-agnostic multilingual fake news classification. Large varieties of experiments, including language-specific and domain-specific settings, are conducted. The proposed model achieves high accuracy in domain-specific and domain-agnostic experiments, and it also outperforms the current state-of-the-art models. We perform experiments on zero-shot settings to assess the effectiveness of language-agnostic feature transfer across different languages, showing encouraging results. Cross-domain transfer experiments are also performed to assess language-independent feature transfer of the model. We also offer a multilingual multidomain fake news detection dataset of five languages and seven different domains that could be useful for the research and development in resource-scarce scenarios.},
journal = {ACM Trans. Asian Low-Resour. Lang. Inf. Process.},
month = {nov},
articleno = {9},
numpages = {20},
keywords = {Swahili, Fake news detection, Indonesian, low-resource languages, Hindi, Vietnamese, multilingual}
}
- Arkadipta De (M.Tech, IIT Hyderabad) [Corresponding Author] (Github - https://github.com/Arko98)
- Dibyanayan Bandyopadhyay (M.Tech, IIT Patna) (Github - https://github.com/newcodevelop)
- Baban Gain (M.Tech, IIT Patna) (Github - https://github.com/babangain)