DICE is a collection of 10,395 Italian news articles describing 13 types of crime events that happened in the province of Modena, Italy, between the end of 2011 and 2021.
Number of documents | 10,395 |
---|---|
Theft | 7,627 (73.37%) |
Drug dealing | 934 (8.99%) |
Aggression | 426 (4.10%) |
Illegal sale | 339 (3.26%) |
Mistreatment | 201 (1.93%) |
Robbery | 182 (1.75%) |
Scam | 171 (1.65%) |
Evasion | 135 (1.30%) |
Sexual violence | 124 (1.19%) |
Money laundering | 98 (0.94%) |
Kidnapping | 66 (0.63%) |
Murder | 54 (0.52%) |
Fraud | 38 (0.37%) |
21 news articles have empty text.
The news articles are published online by the newspaper named Gazzetta di Modena. Thanks to an agreement between the University of Modena and Reggio Emilia and the Gazzetta di Modena, signed on May 2022, the corpus is free to redistribute and transform without encountering legal copyright issues under an Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0).
Moreover, the dataset includes data derived from the abovementioned components thanks to the application of Natural Language Processing techniques. Some examples are the place of the crime event occurrence (municipality, area, address and GPS coordinates), the date of the occurrence, and the type of the crime events described in the news article obtained by an automatic categorization of the text.
The extraction of information is focused on thefts since this is the most frequent crime in the dataset.
At the moment, the data are organized in the following files:
-
italian_crime_news.csv is the actual dataset containing some information about the news articles: id, url, title, sub title, text, place of the event occurrence (municipality, area, address), latitude and longitude (i.e., the GPS coordinates of the place), publication_date, event_date (i.e., when the crime occurred), newspaper_tag (i.e., crime category provided by the newspaper), word2vec_tag (i.e., crime category obtained by applying different categorization algorithms to the embeddings of news articles'body), newspaper/publisher.
-
duplicate detection is the folder containing the configuration of the algorithms used to find duplicates ("algorithm.csv") and the news articles identified as duplicates with the corresponding similarity score ("duplicate.csv").
-
automatic_annotation.jsonl contains the named entities found by Tint, the time expressions identified by Heideltime, and the DBpedia URIs obtained by DBpedia Spotlight from the text of the news for all the news articles in the dataset. The Java code used to extract these annotations is published in the folder automatic annotation.
News selected | 10,395 |
---|---|
Geolocalized news | 8,295 |
Duplicate news | 1,866 |
NER objects | 75,256 |
DBpedia link | 42,545 |
Time expression | 20,832 |
- annotation.jsonl contains the automatic annotation mentioned above and the manual annotation of What was stolen in the theft, Where the theft occurred, Who is the thief or criminal, Who was mugged. This annotation was made by 2 expert annotators and one competent annotator following the guidelines in Linee guida per l'annotazione V2.1.pdf. We selected 1000 news articles for manual annotation, however, entities and relations were annotated in 406 news articles since 161 news articles are releted to other types of crimes, 135 to multiple events, and the remaining 298 do not concern crimes, thus, they were not annotated. N.B. this file is constantly updated to add new annotated news articles!
News selected | 1000 |
---|---|
Single event theft news | 406 |
OBJ - relations for OBJ | 664 - 74 |
AUT - relations for AUT | 675 - 259 |
AUTG | 162 |
VIC - relations for VIC | 203 - 63 |
VICG | 23 |
PAR | 175 |
LOC | 686 |
- annotation csv is the folder containing one file for each news article, the name of the files is the identifier (id) used in italian_crime_news.csv, the format of the files is a CSV with three columns: the token of the news article's text, the labels associated to that token by the automatic annotation and the manual annotation and, if present, the relations found by the manual annotation. The data contained in these files are also in the files automatic_annotation.jsonl and annotation.jsonl.
Other researchers can employ the dataset to apply other algorithms of text categorization and duplicate detection and compare their results with the benchmark. The dataset can be useful for several scopes, e.g., geo-localization of the events, text summarization, crime analysis, crime prediction, community detection, topic modeling, news recommendation.
If the dataset is useful, please consider citing papers using the BibTex entry below.
@inproceedings{bonisoli2023,
author = {Giovanni Bonisoli and
Maria Pia di Buono and
Laura Po and
Federica Rollo},
editor = {Hsin-Hsi Chen and
Wei-Jou Edward Duh and
Hen-Hsen Huang and
Makoto P. Kato and
Josiane Mothe and
Barbara Poblete},
title = {DICE: A Dataset of Italian Crime Event news},
booktitle = {{SIGIR} '23: The 46th International {ACM} {SIGIR} Conference on Research
and Development in Information Retrieval, Taipei, Taiwan, July 23 - 27, 2023},
publisher = {{ACM}},
year = {2023},
doi = {10.1145/3539618.3591904}
}
@inproceedings{rollo2020,
author = {Federica Rollo and
Laura Po},
editor = {Jeff Z. Pan and
Valentina A. M. Tamma and
Claudia d'Amato and
Krzysztof Janowicz and
Bo Fu and
Axel Polleres and
Oshani Seneviratne and
Lalana Kagal},
title = {Crime Event Localization and Deduplication},
booktitle = {The Semantic Web - {ISWC} 2020 - 19th International Semantic Web Conference,
Athens, Greece, November 2-6, 2020, Proceedings, Part {II}},
series = {Lecture Notes in Computer Science},
volume = {12507},
pages = {361--377},
publisher = {Springer},
year = {2020},
doi = {10.1007/978-3-030-62466-8\_23}
}
@inproceedings{bonisoli2021,
author = {Giovanni Bonisoli and
Federica Rollo and
Laura Po},
editor = {Maria Ganzha and
Leszek A. Maciaszek and
Marcin Paprzycki and
Dominik Slezak},
title = {Using Word Embeddings for Italian Crime News Categorization},
booktitle = {Proceedings of the 16th Conference on Computer Science and Intelligence Systems, Online, September 2-5, 2021},
pages = {461--470},
year = {2021},
url = {https://doi.org/10.15439/2021F118},
doi = {10.15439/2021F118}
}
@inproceedings{rollo2021,
author = {Federica Rollo and
Giovanni Bonisoli and
Laura Po},
editor = {Ewa Ziemba and
Witold Chmielarz},
title = {Supervised and Unsupervised Categorization of an Imbalanced Italian Crime News Dataset},
booktitle = {Information Technology for Management: Business and Social Issues
- 16th Conference, {ISM} 2021, and FedCSIS-AIST 2021 Track, Held as
Part of FedCSIS 2021, Virtual Event, September 2-5, 2021, Extended
and Revised Selected Papers},
series = {Lecture Notes in Business Information Processing},
volume = {442},
pages = {117--139},
publisher = {Springer},
year = {2021},
url = {https://doi.org/10.1007/978-3-030-98997-2\_6},
doi = {10.1007/978-3-030-98997-2\_6}
}
@inproceedings{rollo2022,
author = {Federica Rollo and Laura Po and Giovanni Bonisoli},
title = {Online News Event Extraction for Crime Analysis},
booktitle = {Proceedings of the 30th Italian Symposium on Advanced Database Systems,
{SEBD} 2022, Tirrenia (PI), Italy, June 19-22, 2022},
series = {{CEUR} Workshop Proceedings}
publisher = {CEUR-WS.org},
year = {2022}
}