diff --git a/.gitignore b/.gitignore new file mode 100644 index 0000000..d944721 --- /dev/null +++ b/.gitignore @@ -0,0 +1,2 @@ +_site/ +Gemfile.lock diff --git a/Gemfile b/Gemfile new file mode 100644 index 0000000..8c73203 --- /dev/null +++ b/Gemfile @@ -0,0 +1,4 @@ +source 'https://rubygems.org' +gem "github-pages", group: :jekyll_plugins + +gem "webrick", "~> 1.8" diff --git a/README.md b/README.md new file mode 100644 index 0000000..460588b --- /dev/null +++ b/README.md @@ -0,0 +1,194 @@ +# Corpus of German misogynistic hatespeech posts (GMHP7k) +## A German Corpus on misogynistic hatespeech posts from Twitter +On this page we provide the data set for the corpus on German misogynistic hatespeech posts (GMHP7k), which was first presented on the [18th International AAAI Conference on Web and Social Media](https://www.icwsm.org/2024/) (ICWSM 2024) along with a dataset paper. + + +## Description +We provide a German corpus consisting of 7,061 posts authored by users of social media platforms. A group of volunteers annotated each post according to hatespeech and misogynistic/misogynous hatespeech in a binary fashion. The inter-rater reliability over all annotators according to Fleiss’ Kappa is 0.6409 for hatespeech and 0.8258 for misogynistic hatespeech. Furthermore, baseline measurements with machine learning based text classification with BERT are presented. Initial experiments with the corpus achieve macro average F1-scores up to 0.79 for hatespeech and 0.75 for misogynistic hatespeech. + +### Classes to annotate +During annotation, volunteers rated two aspects of a post: the presence of *hatespeech* and *misogynistic hatespeech*. The availability of hatespeech depends on perception of the comment text by the annotators and can be rated as *hatespeech* or *not hatespeech*. The misogynistic hatespeech, on the other hand, can be either *misogynistic hatespeech* or *not misogynistic hatespeech*. + +### Data Description + + +++++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ColumnNameDescription
tweet_idTweet IDSource ID from Twitter
review_textText of the tweet or commentUser mentions were replaced by @TwitterUser
hsHatespeech annotationBinary (1 or 0)
m_hsMisogynistic hatespeech annotationBinary (1 or 0)
annotation_idID of annotationTweets of phase 2 were annotated by all experts
created_atCreated timestamp of annotation
updated_atUpdated timestamp of annotation
lead_timeElapsed time of annotation
phasePhase1, 2.1, 2.2, 2.3 or 3
annotator_nameAnnotator namePseudonym Identity of the annotators as consecutive numbers
sourceSource of textSouce dataset of the text
split_hsSource of text“train”, “test”, or “val”
split_m_hsSource of text“train”, “test”, or “val”
+ +## Statistics +In order to achieve a high quality of annotation, two preliminary training phases were carried out, whereby the volunteers evaluated 46, 43 and 46 posts in each phase. After each phase, an inter-rater reliability was conducted with [Fleiss' Kappa](https://en.wikipedia.org/wiki/Fleiss%27_kappa) to measure the quality of the annotation. The resulting kappa values are shown in figure 1. The values for *hatespeech* are shown on the left, those of the *misogynistic hatespeech* on the right. In order to determine the impact of each volunteer on the kappa value, further kappa values were calculated for all combinations of n-1 volunteers. + +
+
+ +
Fig.1 - Interrater-reliability (LTR hatespeech and misogynistic hatespeech)
+
+
+ +
+
+ +
Fig.2 - Wordclounds (LTR neutral, hatespeech and misogynistic hatespeech)
+
+
+ +After completion of the training phases, a further 7,061 posts were annotated, which form the core of the corpus. Their quality can be considered assured due to the solid inter-rater reliability of the training phases. Table 1 shows the quota of the 7,061 posts assigned to each class. The distribution +of hatespeech reveals that 22.29 % of the post were annotated as hatespeech. The table also shows the distribution of the second criterion misogynistic hatespeech, with 6.51 % of all posts are being rated as misogynisitc hatespeech. Consequently, 29.22 % of hatespeech posts are also misogynistic. + +
+
Tab.1 - Number of posts per class in 7,061 posts
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
PostsPercent
Hot hatespeech5,48777.71 %
Hatespeech1,57422.29 %
Not misogynistic hatespeech6,60193.49 %
Misogynistic hatespeech4606.51 %
+
+ +## License +The corpus is provided under the terms of the [Creative Commons Attribution 4.0 International (CC BY 4.0) License](https://creativecommons.org/licenses/by/4.0/). By using the corpus you agree to this license. + +license + + + +## How to use the data set? +The [repository to this page](https://github.com/ccwi/corpus-gmhp7k) provides the data set to the corpus along with the statistics and instructions for use. + +## About +The presented corpus was developed during a project of the Competence Center Wirtschaftsinformatik (CCWI) at the Munich University of Applied Sciences. + + + +## Acknowledgement +Our special thanks goes to the experts who contributed to the annotation of the corpus. The presented work was conducted as part of a project funded by *Forschungs- und Entwicklungsprogramm Informations- und Kommunikationstechnik des Freistaates Bayern*. Funding reference number: DIK-2104-0033// DIK0278/01, DIK0278/02, +DIK0278/03. + +The methodology of this work was inspired by the great work of [Schabus et. al.](http://dx.doi.org/10.1145/3077136.3080711) wo created the [One Million Posts Corpus](https://ofai.github.io/million-post-corpus/) together with the Austrian newspaper *Der Standard* from user comments under online articles on the site of the newspaper. + + diff --git a/_config.yml b/_config.yml new file mode 100644 index 0000000..e61d5bd --- /dev/null +++ b/_config.yml @@ -0,0 +1,4 @@ +theme: jekyll-theme-primer +# theme: jekyll-theme-slate +title: Corpus of German misogynistic hatespeech posts (GMHP7k) +description: A corpus of 6000 annotated german social media posts from six brand pages on Facebook diff --git a/_layouts/default-slate.html b/_layouts/default-slate.html new file mode 100644 index 0000000..d56389a --- /dev/null +++ b/_layouts/default-slate.html @@ -0,0 +1,62 @@ + + + + + + + + + +{% seo %} + + + + + +
+
+ {% if site.github.is_project_page %} + View on GitHub + {% endif %} + +

{{ site.title | default: site.github.repository_name }}

+

{{ site.description | default: site.github.project_tagline }}

+ + {% if site.show_downloads %} +
+ Download this project as a .zip file + Download this project as a tar.gz file +
+ {% endif %} +
+
+ + +
+
+ {{ content }} +
+
+ + + + + {% if site.google_analytics %} + + {% endif %} + + diff --git a/_layouts/default.html b/_layouts/default.html new file mode 100644 index 0000000..b5a4363 --- /dev/null +++ b/_layouts/default.html @@ -0,0 +1,38 @@ + + + + + + + +{% seo %} + + + +
+ {% if site.title and site.title != page.title %} +

{{ site.title }}

+ {% endif %} + + {{ content }} + + {% if site.github.private != true and site.github.license %} + + {% endif %} +
+ + + {% if site.google_analytics %} + + {% endif %} + + diff --git a/assets/css/style.scss b/assets/css/style.scss new file mode 100644 index 0000000..ff9937e --- /dev/null +++ b/assets/css/style.scss @@ -0,0 +1 @@ +@import "{{ site.theme }}"; diff --git a/images/interrater-reliability_hs.png b/images/interrater-reliability_hs.png new file mode 100644 index 0000000..14f8098 Binary files /dev/null and b/images/interrater-reliability_hs.png differ diff --git a/images/interrater-reliability_mhs.png b/images/interrater-reliability_mhs.png new file mode 100644 index 0000000..d150cb3 Binary files /dev/null and b/images/interrater-reliability_mhs.png differ diff --git a/images/logo-ccwi.png b/images/logo-ccwi.png new file mode 100644 index 0000000..6b07184 Binary files /dev/null and b/images/logo-ccwi.png differ diff --git a/images/wordcloud_hatespeech_50.png b/images/wordcloud_hatespeech_50.png new file mode 100644 index 0000000..1870b89 Binary files /dev/null and b/images/wordcloud_hatespeech_50.png differ diff --git a/images/wordcloud_misogynistic_hatespeech_50.png b/images/wordcloud_misogynistic_hatespeech_50.png new file mode 100644 index 0000000..cfbd646 Binary files /dev/null and b/images/wordcloud_misogynistic_hatespeech_50.png differ diff --git a/images/wordcloud_neutral_50.png b/images/wordcloud_neutral_50.png new file mode 100644 index 0000000..6d3870f Binary files /dev/null and b/images/wordcloud_neutral_50.png differ