Initial commit

CCWI · Mar 28, 2024 · 6066b18 · 6066b18
commit 6066b18
Show file tree

Hide file tree

Showing 13 changed files with 305 additions and 0 deletions.
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,2 @@
+_site/
+Gemfile.lock
diff --git a/Gemfile b/Gemfile
@@ -0,0 +1,4 @@
+source 'https://rubygems.org'
+gem "github-pages", group: :jekyll_plugins
+
+gem "webrick", "~> 1.8"
diff --git a/README.md b/README.md
@@ -0,0 +1,194 @@
+# Corpus of German misogynistic hatespeech posts (GMHP7k)
+## A German Corpus on misogynistic hatespeech posts from Twitter
+On this page we provide the data set for the corpus on German misogynistic hatespeech posts (GMHP7k), which was first presented on the [18th International AAAI Conference on Web and Social Media](https://www.icwsm.org/2024/) (ICWSM 2024) along with a dataset paper.
+<!-- Details can be found in the section [Citation](#citation) below. -->
+
+## Description
+We provide a German corpus consisting of 7,061 posts authored by users of social media platforms. A group of volunteers annotated each post according to hatespeech and misogynistic/misogynous hatespeech in a binary fashion. The inter-rater reliability over all annotators according to Fleiss’ Kappa is 0.6409 for hatespeech and 0.8258 for misogynistic hatespeech. Furthermore, baseline measurements with machine learning based text classification with BERT are presented. Initial experiments with the corpus achieve macro average F1-scores up to 0.79 for hatespeech and 0.75 for misogynistic hatespeech.
+
+### Classes to annotate
+During annotation, volunteers rated two aspects of a post: the presence of *hatespeech* and *misogynistic hatespeech*. The availability of hatespeech depends on perception of the comment text by the annotators and can be rated as *hatespeech* or *not hatespeech*. The misogynistic hatespeech, on the other hand, can be either *misogynistic hatespeech* or *not misogynistic hatespeech*.
+
+### Data Description
+
+<table>
+<colgroup>
+<col style="width: 18%" />
+<col style="width: 36%" />
+<col style="width: 44%" />
+</colgroup>
+<thead>
+<tr class="header">
+<th style="text-align: left;">Column</th>
+<th style="text-align: left;">Name</th>
+<th>Description</th>
+</tr>
+</thead>
+<tbody>
+<tr class="odd">
+<td style="text-align: left;">tweet_id</td>
+<td style="text-align: left;">Tweet ID</td>
+<td>Source ID from Twitter</td>
+</tr>
+<tr class="even">
+<td style="text-align: left;">review_text</td>
+<td style="text-align: left;">Text of the tweet or comment</td>
+<td>User mentions were replaced by <span class="citation"
+data-cites="TwitterUser">@TwitterUser</span></td>
+</tr>
+<tr class="odd">
+<td style="text-align: left;">hs</td>
+<td style="text-align: left;">Hatespeech annotation</td>
+<td>Binary (1 or 0)</td>
+</tr>
+<tr class="even">
+<td style="text-align: left;">m_hs</td>
+<td style="text-align: left;">Misogynistic hatespeech annotation</td>
+<td>Binary (1 or 0)</td>
+</tr>
+<tr class="even">
+<td style="text-align: left;">annotation_id</td>
+<td style="text-align: left;">ID of annotation</td>
+<td>Tweets of phase 2 were annotated by all experts</td>
+</tr>
+<tr class="odd">
+<td style="text-align: left;">created_at</td>
+<td style="text-align: left;">Created timestamp of annotation</td>
+<td></td>
+</tr>
+<tr class="even">
+<td style="text-align: left;">updated_at</td>
+<td style="text-align: left;">Updated timestamp of annotation</td>
+<td></td>
+</tr>
+<tr class="odd">
+<td style="text-align: left;">lead_time</td>
+<td style="text-align: left;">Elapsed time of annotation</td>
+<td></td>
+</tr>
+<tr class="even">
+<td style="text-align: left;">phase</td>
+<td style="text-align: left;">Phase</td>
+<td>1, 2.1, 2.2, 2.3 or 3</td>
+</tr>
+<tr class="odd">
+<td style="text-align: left;">annotator_name</td>
+<td style="text-align: left;">Annotator name</td>
+<td>Pseudonym Identity of the annotators as consecutive numbers</td>
+</tr>
+<tr class="even">
+<td style="text-align: left;">source</td>
+<td style="text-align: left;">Source of text</td>
+<td>Souce dataset of the text</td>
+</tr>
+<tr class="odd">
+<td style="text-align: left;">split_hs</td>
+<td style="text-align: left;">Source of text</td>
+<td>“train”, “test”, or “val”</td>
+</tr>
+<tr class="even">
+<td style="text-align: left;">split_m_hs</td>
+<td style="text-align: left;">Source of text</td>
+<td>“train”, “test”, or “val”</td>
+</tr>
+</tbody>
+</table>
+
+## Statistics
+In order to achieve a high quality of annotation, two preliminary training phases were carried out, whereby the volunteers evaluated 46, 43 and 46 posts in each phase. After each phase, an inter-rater reliability was conducted with [Fleiss' Kappa](https://en.wikipedia.org/wiki/Fleiss%27_kappa) to measure the quality of the annotation. The resulting kappa values are shown in figure 1. The values for *hatespeech* are shown on the left, those of the *misogynistic hatespeech* on the right. In order to determine the impact of each volunteer on the kappa value, further kappa values were calculated for all combinations of n-1 volunteers.
+
+<div>
+<figure>
+<img src="images/interrater-reliability_hs.png" width="49%"><img src="images/interrater-reliability_mhs.png" width="49%">
+<figcaption align="center">Fig.1 - Interrater-reliability (LTR hatespeech and misogynistic hatespeech)</figcaption>
+</figure>
+</div>
+
+<div>
+<figure>
+<img src="images/wordcloud_neutral_50.png" width="31%" align="left"><img src="images/wordcloud_hatespeech_50.png" width="31%" align="middle"><img src="images/wordcloud_misogynistic_hatespeech_50.png" width="31%" align="right">
+<figcaption align="center">Fig.2 - Wordclounds (LTR neutral, hatespeech and misogynistic hatespeech)</figcaption>
+</figure>
+</div>
+
+After completion of the training phases, a further 7,061 posts were annotated, which form the core of the corpus. Their quality can be considered assured due to the solid inter-rater reliability of the training phases. Table 1 shows the quota of the 7,061 posts assigned to each class. The distribution
+of hatespeech reveals that 22.29 % of the post were annotated as hatespeech. The table also shows the distribution of the second criterion misogynistic hatespeech, with 6.51 % of all posts are being rated as misogynisitc hatespeech. Consequently, 29.22 % of hatespeech posts are also misogynistic.
+
+<figure>
+<figcaption>Tab.1 - Number of posts per class in 7,061 posts</figcaption>
+<table style="margin: 0px auto;">
+  <thead>
+    <tr>
+      <th></th>
+      <th align="right">Posts</th>
+      <th align="right">Percent</th>
+    </tr>
+  </thead>
+  <tbody>
+    <tr>
+      <td align="left">Hot hatespeech</td>
+      <td align="right">5,487</td>
+      <td align="right">77.71 %</td>
+    </tr>
+    <tr>
+      <td align="left">Hatespeech</td>
+      <td align="right">1,574</td>
+      <td align="right">22.29 %</td>
+    </tr>
+    <tr>
+      <td align="left">Not misogynistic hatespeech</td>
+      <td align="right">6,601</td>
+      <td align="right">93.49 %</td>
+    </tr>
+    <tr>
+      <td align="left">Misogynistic hatespeech</td>
+      <td align="right">460</td>
+      <td align="right">6.51 %</td>
+    </tr>
+  </tbody>
+</table>
+</figure>
+
+## License
+The corpus is provided under the terms of the [Creative Commons Attribution 4.0 International (CC BY 4.0) License](https://creativecommons.org/licenses/by/4.0/). By using the corpus you agree to this license.
+
+<img alt="license" src="https://mirrors.creativecommons.org/presskit/buttons/88x31/png/by.png" width="118" height="41">
+
+<!--
+## Citation
+The corpus was first presented at [ICWSM 2024](https://www.icwsm.org/2024/).
+> *Jonas Glasebach, Max-Emanuel Keller, Alexander Döschl, Peter Mandl<br>
+> **GMHP7k: A corpus of German misogynistic hatespeech posts**<br>
+> Proceedings of the Eighteenth International AAAI Conference on Web and Social Media<br>
+> Buffalo, NY, USA, June 6–9, 2024<br>*
+
+If you are using the corpus, please cite the following publication. You can find a copy of the [paper here](https://www.icwsm.org/2024/). Reference in BibTeX format:
+ ```
+@inproceedings{Glasebach.2024,
+ author = {Glasebach, Jonas and Keller, Max-Emanuel and Döschl, Alexander and Mandl, Peter},
+ title = {GMHP7k: A corpus of German misogynistic hatespeech posts},
+ booktitle = {Proceedings of the 25th Conference of Open Innovations Association FRUCT},
+ series = {ICWSM 2024},
+ year = {2024},
+ location = {Buffalo, NY, USA},
+}
+ ```
+-->
+
+## How to use the data set?
+The [repository to this page](https://github.com/ccwi/corpus-gmhp7k) provides the data set to the corpus along with the statistics and instructions for use.
+
+## About
+The presented corpus was developed during a project of the <a href="https://www.wirtschaftsinformatik-muenchen.de/">Competence Center Wirtschaftsinformatik (CCWI)</a> at the Munich University of Applied Sciences.
+
+<a href="https://www.wirtschaftsinformatik-muenchen.de/"><img src="images/logo-ccwi.png" height="50px"></a>
+
+## Acknowledgement
+Our special thanks goes to the experts who contributed to the annotation of the corpus. The presented work was conducted as part of a project funded by *Forschungs- und Entwicklungsprogramm Informations- und Kommunikationstechnik des Freistaates Bayern*. Funding reference number: DIK-2104-0033// DIK0278/01, DIK0278/02,
+DIK0278/03.
+
+The methodology of this work was inspired by the great work of [Schabus et. al.](http://dx.doi.org/10.1145/3077136.3080711) wo created the [One Million Posts Corpus](https://ofai.github.io/million-post-corpus/) together with the Austrian newspaper *Der Standard* from user comments under online articles on the site of the newspaper.
+
+<!--
+## How to run the experiments?
+-->
diff --git a/_config.yml b/_config.yml
@@ -0,0 +1,4 @@
+theme: jekyll-theme-primer
+# theme: jekyll-theme-slate
+title: Corpus of German misogynistic hatespeech posts (GMHP7k)
+description: A corpus of 6000 annotated german social media posts from six brand pages on Facebook
diff --git a/_layouts/default-slate.html b/_layouts/default-slate.html
@@ -0,0 +1,62 @@
+<!DOCTYPE html>
+<html lang="{{ site.lang | default: "en-US" }}">
+
+  <head>
+    <meta charset='utf-8'>
+    <meta http-equiv="X-UA-Compatible" content="IE=edge">
+    <meta name="viewport" content="width=device-width,maximum-scale=2">
+    <link rel="stylesheet" type="text/css" media="screen" href="{{ '/assets/css/style.css?v=' | append: site.github.build_revision | relative_url }}">
+
+{% seo %}
+  </head>
+
+  <body>
+
+    <!-- HEADER -->
+    <div id="header_wrap" class="outer">
+        <header class="inner">
+          {% if site.github.is_project_page %}
+            <a id="forkme_banner" href="{{ site.github.repository_url }}">View on GitHub</a>
+          {% endif %}
+
+          <h1 id="project_title">{{ site.title | default: site.github.repository_name }}</h1>
+          <h2 id="project_tagline">{{ site.description | default: site.github.project_tagline }}</h2>
+
+          {% if site.show_downloads %}
+            <section id="downloads">
+              <a class="zip_download_link" href="{{ site.github.zip_url }}">Download this project as a .zip file</a>
+              <a class="tar_download_link" href="{{ site.github.tar_url }}">Download this project as a tar.gz file</a>
+            </section>
+          {% endif %}
+        </header>
+    </div>
+
+    <!-- MAIN CONTENT -->
+    <div id="main_content_wrap" class="outer">
+      <section id="main_content" class="inner">
+        {{ content }}
+      </section>
+    </div>
+
+    <!-- FOOTER  -->
+    <div id="footer_wrap" class="outer">
+      <footer class="inner">
+        {% if site.github.is_project_page %}
+        <p class="copyright">{{ site.title | default: site.github.repository_name }} maintained by <a href="{{ site.github.owner_url }}">{{ site.github.owner_name }}</a></p>
+        {% endif %}
+        <p>Published with <a href="https://pages.github.com">GitHub Pages</a></p>
+      </footer>
+    </div>
+
+    {% if site.google_analytics %}
+      <script>
+        (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
+        (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
+        m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
+        })(window,document,'script','//www.google-analytics.com/analytics.js','ga');
+        ga('create', '{{ site.google_analytics }}', 'auto');
+        ga('send', 'pageview');
+      </script>
+    {% endif %}
+  </body>
+</html>
diff --git a/_layouts/default.html b/_layouts/default.html
@@ -0,0 +1,38 @@
+<!DOCTYPE html>
+<html lang="{{ site.lang | default: "en-US" }}">
+  <head>
+    <meta charset="UTF-8">
+    <meta http-equiv="X-UA-Compatible" content="IE=edge">
+    <meta name="viewport" content="width=device-width, initial-scale=1">
+
+{% seo %}
+    <link rel="stylesheet" href="{{ "/assets/css/style.css?v=" | append: site.github.build_revision | relative_url }}">
+  </head>
+  <body>
+    <div class="container-lg px-3 my-5 markdown-body">
+      {% if site.title and site.title != page.title %}
+      <h1><a href="{{ "/" | absolute_url }}">{{ site.title }}</a></h1>
+      {% endif %}
+
+      {{ content }}
+
+      {% if site.github.private != true and site.github.license %}
+      <div class="footer border-top border-gray-light mt-5 pt-3 text-right text-gray">
+        This site is open source. {% github_edit_link "Improve this page" %}.
+      </div>
+      {% endif %}
+    </div>
+    <script src="https://cdnjs.cloudflare.com/ajax/libs/anchor-js/4.1.0/anchor.min.js" integrity="sha256-lZaRhKri35AyJSypXXs4o6OPFTbTmUoltBbDCbdzegg=" crossorigin="anonymous"></script>
+    <script>anchors.add();</script>
+    {% if site.google_analytics %}
+    <script>
+      (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
+      (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
+      m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
+      })(window,document,'script','//www.google-analytics.com/analytics.js','ga');
+      ga('create', '{{ site.google_analytics }}', 'auto');
+      ga('send', 'pageview');
+    </script>
+    {% endif %}
+  </body>
+</html>
diff --git a/assets/css/style.scss b/assets/css/style.scss
@@ -0,0 +1 @@
+@import "{{ site.theme }}";
diff --git a/images/interrater-reliability_hs.png b/images/interrater-reliability_hs.png
diff --git a/images/interrater-reliability_mhs.png b/images/interrater-reliability_mhs.png
diff --git a/images/logo-ccwi.png b/images/logo-ccwi.png
diff --git a/images/wordcloud_hatespeech_50.png b/images/wordcloud_hatespeech_50.png
diff --git a/images/wordcloud_misogynistic_hatespeech_50.png b/images/wordcloud_misogynistic_hatespeech_50.png
diff --git a/images/wordcloud_neutral_50.png b/images/wordcloud_neutral_50.png