Skip to content

Commit

Permalink
Initial commit
Browse files Browse the repository at this point in the history
  • Loading branch information
mxekeller committed Mar 28, 2024
0 parents commit 6066b18
Show file tree
Hide file tree
Showing 13 changed files with 305 additions and 0 deletions.
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
_site/
Gemfile.lock
4 changes: 4 additions & 0 deletions Gemfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
source 'https://rubygems.org'
gem "github-pages", group: :jekyll_plugins

gem "webrick", "~> 1.8"
194 changes: 194 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,194 @@
# Corpus of German misogynistic hatespeech posts (GMHP7k)
## A German Corpus on misogynistic hatespeech posts from Twitter
On this page we provide the data set for the corpus on German misogynistic hatespeech posts (GMHP7k), which was first presented on the [18th International AAAI Conference on Web and Social Media](https://www.icwsm.org/2024/) (ICWSM 2024) along with a dataset paper.
<!-- Details can be found in the section [Citation](#citation) below. -->

## Description
We provide a German corpus consisting of 7,061 posts authored by users of social media platforms. A group of volunteers annotated each post according to hatespeech and misogynistic/misogynous hatespeech in a binary fashion. The inter-rater reliability over all annotators according to Fleiss’ Kappa is 0.6409 for hatespeech and 0.8258 for misogynistic hatespeech. Furthermore, baseline measurements with machine learning based text classification with BERT are presented. Initial experiments with the corpus achieve macro average F1-scores up to 0.79 for hatespeech and 0.75 for misogynistic hatespeech.

### Classes to annotate
During annotation, volunteers rated two aspects of a post: the presence of *hatespeech* and *misogynistic hatespeech*. The availability of hatespeech depends on perception of the comment text by the annotators and can be rated as *hatespeech* or *not hatespeech*. The misogynistic hatespeech, on the other hand, can be either *misogynistic hatespeech* or *not misogynistic hatespeech*.

### Data Description

<table>
<colgroup>
<col style="width: 18%" />
<col style="width: 36%" />
<col style="width: 44%" />
</colgroup>
<thead>
<tr class="header">
<th style="text-align: left;">Column</th>
<th style="text-align: left;">Name</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td style="text-align: left;">tweet_id</td>
<td style="text-align: left;">Tweet ID</td>
<td>Source ID from Twitter</td>
</tr>
<tr class="even">
<td style="text-align: left;">review_text</td>
<td style="text-align: left;">Text of the tweet or comment</td>
<td>User mentions were replaced by <span class="citation"
data-cites="TwitterUser">@TwitterUser</span></td>
</tr>
<tr class="odd">
<td style="text-align: left;">hs</td>
<td style="text-align: left;">Hatespeech annotation</td>
<td>Binary (1 or 0)</td>
</tr>
<tr class="even">
<td style="text-align: left;">m_hs</td>
<td style="text-align: left;">Misogynistic hatespeech annotation</td>
<td>Binary (1 or 0)</td>
</tr>
<tr class="even">
<td style="text-align: left;">annotation_id</td>
<td style="text-align: left;">ID of annotation</td>
<td>Tweets of phase 2 were annotated by all experts</td>
</tr>
<tr class="odd">
<td style="text-align: left;">created_at</td>
<td style="text-align: left;">Created timestamp of annotation</td>
<td></td>
</tr>
<tr class="even">
<td style="text-align: left;">updated_at</td>
<td style="text-align: left;">Updated timestamp of annotation</td>
<td></td>
</tr>
<tr class="odd">
<td style="text-align: left;">lead_time</td>
<td style="text-align: left;">Elapsed time of annotation</td>
<td></td>
</tr>
<tr class="even">
<td style="text-align: left;">phase</td>
<td style="text-align: left;">Phase</td>
<td>1, 2.1, 2.2, 2.3 or 3</td>
</tr>
<tr class="odd">
<td style="text-align: left;">annotator_name</td>
<td style="text-align: left;">Annotator name</td>
<td>Pseudonym Identity of the annotators as consecutive numbers</td>
</tr>
<tr class="even">
<td style="text-align: left;">source</td>
<td style="text-align: left;">Source of text</td>
<td>Souce dataset of the text</td>
</tr>
<tr class="odd">
<td style="text-align: left;">split_hs</td>
<td style="text-align: left;">Source of text</td>
<td>“train”, “test”, or “val”</td>
</tr>
<tr class="even">
<td style="text-align: left;">split_m_hs</td>
<td style="text-align: left;">Source of text</td>
<td>“train”, “test”, or “val”</td>
</tr>
</tbody>
</table>

## Statistics
In order to achieve a high quality of annotation, two preliminary training phases were carried out, whereby the volunteers evaluated 46, 43 and 46 posts in each phase. After each phase, an inter-rater reliability was conducted with [Fleiss' Kappa](https://en.wikipedia.org/wiki/Fleiss%27_kappa) to measure the quality of the annotation. The resulting kappa values are shown in figure 1. The values for *hatespeech* are shown on the left, those of the *misogynistic hatespeech* on the right. In order to determine the impact of each volunteer on the kappa value, further kappa values were calculated for all combinations of n-1 volunteers.

<div>
<figure>
<img src="images/interrater-reliability_hs.png" width="49%"><img src="images/interrater-reliability_mhs.png" width="49%">
<figcaption align="center">Fig.1 - Interrater-reliability (LTR hatespeech and misogynistic hatespeech)</figcaption>
</figure>
</div>

<div>
<figure>
<img src="images/wordcloud_neutral_50.png" width="31%" align="left"><img src="images/wordcloud_hatespeech_50.png" width="31%" align="middle"><img src="images/wordcloud_misogynistic_hatespeech_50.png" width="31%" align="right">
<figcaption align="center">Fig.2 - Wordclounds (LTR neutral, hatespeech and misogynistic hatespeech)</figcaption>
</figure>
</div>

After completion of the training phases, a further 7,061 posts were annotated, which form the core of the corpus. Their quality can be considered assured due to the solid inter-rater reliability of the training phases. Table 1 shows the quota of the 7,061 posts assigned to each class. The distribution
of hatespeech reveals that 22.29 % of the post were annotated as hatespeech. The table also shows the distribution of the second criterion misogynistic hatespeech, with 6.51 % of all posts are being rated as misogynisitc hatespeech. Consequently, 29.22 % of hatespeech posts are also misogynistic.

<figure>
<figcaption>Tab.1 - Number of posts per class in 7,061 posts</figcaption>
<table style="margin: 0px auto;">
<thead>
<tr>
<th></th>
<th align="right">Posts</th>
<th align="right">Percent</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">Hot hatespeech</td>
<td align="right">5,487</td>
<td align="right">77.71 %</td>
</tr>
<tr>
<td align="left">Hatespeech</td>
<td align="right">1,574</td>
<td align="right">22.29 %</td>
</tr>
<tr>
<td align="left">Not misogynistic hatespeech</td>
<td align="right">6,601</td>
<td align="right">93.49 %</td>
</tr>
<tr>
<td align="left">Misogynistic hatespeech</td>
<td align="right">460</td>
<td align="right">6.51 %</td>
</tr>
</tbody>
</table>
</figure>

## License
The corpus is provided under the terms of the [Creative Commons Attribution 4.0 International (CC BY 4.0) License](https://creativecommons.org/licenses/by/4.0/). By using the corpus you agree to this license.

<img alt="license" src="https://mirrors.creativecommons.org/presskit/buttons/88x31/png/by.png" width="118" height="41">

<!--
## Citation
The corpus was first presented at [ICWSM 2024](https://www.icwsm.org/2024/).
> *Jonas Glasebach, Max-Emanuel Keller, Alexander Döschl, Peter Mandl<br>
> **GMHP7k: A corpus of German misogynistic hatespeech posts**<br>
> Proceedings of the Eighteenth International AAAI Conference on Web and Social Media<br>
> Buffalo, NY, USA, June 6–9, 2024<br>*
If you are using the corpus, please cite the following publication. You can find a copy of the [paper here](https://www.icwsm.org/2024/). Reference in BibTeX format:
```
@inproceedings{Glasebach.2024,
author = {Glasebach, Jonas and Keller, Max-Emanuel and Döschl, Alexander and Mandl, Peter},
title = {GMHP7k: A corpus of German misogynistic hatespeech posts},
booktitle = {Proceedings of the 25th Conference of Open Innovations Association FRUCT},
series = {ICWSM 2024},
year = {2024},
location = {Buffalo, NY, USA},
}
```
-->

## How to use the data set?
The [repository to this page](https://github.com/ccwi/corpus-gmhp7k) provides the data set to the corpus along with the statistics and instructions for use.

## About
The presented corpus was developed during a project of the <a href="https://www.wirtschaftsinformatik-muenchen.de/">Competence Center Wirtschaftsinformatik (CCWI)</a> at the Munich University of Applied Sciences.

<a href="https://www.wirtschaftsinformatik-muenchen.de/"><img src="images/logo-ccwi.png" height="50px"></a>

## Acknowledgement
Our special thanks goes to the experts who contributed to the annotation of the corpus. The presented work was conducted as part of a project funded by *Forschungs- und Entwicklungsprogramm Informations- und Kommunikationstechnik des Freistaates Bayern*. Funding reference number: DIK-2104-0033// DIK0278/01, DIK0278/02,
DIK0278/03.

The methodology of this work was inspired by the great work of [Schabus et. al.](http://dx.doi.org/10.1145/3077136.3080711) wo created the [One Million Posts Corpus](https://ofai.github.io/million-post-corpus/) together with the Austrian newspaper *Der Standard* from user comments under online articles on the site of the newspaper.

<!--
## How to run the experiments?
-->
4 changes: 4 additions & 0 deletions _config.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
theme: jekyll-theme-primer
# theme: jekyll-theme-slate
title: Corpus of German misogynistic hatespeech posts (GMHP7k)
description: A corpus of 6000 annotated german social media posts from six brand pages on Facebook
62 changes: 62 additions & 0 deletions _layouts/default-slate.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
<!DOCTYPE html>
<html lang="{{ site.lang | default: "en-US" }}">

<head>
<meta charset='utf-8'>
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width,maximum-scale=2">
<link rel="stylesheet" type="text/css" media="screen" href="{{ '/assets/css/style.css?v=' | append: site.github.build_revision | relative_url }}">

{% seo %}
</head>

<body>

<!-- HEADER -->
<div id="header_wrap" class="outer">
<header class="inner">
{% if site.github.is_project_page %}
<a id="forkme_banner" href="{{ site.github.repository_url }}">View on GitHub</a>
{% endif %}

<h1 id="project_title">{{ site.title | default: site.github.repository_name }}</h1>
<h2 id="project_tagline">{{ site.description | default: site.github.project_tagline }}</h2>

{% if site.show_downloads %}
<section id="downloads">
<a class="zip_download_link" href="{{ site.github.zip_url }}">Download this project as a .zip file</a>
<a class="tar_download_link" href="{{ site.github.tar_url }}">Download this project as a tar.gz file</a>
</section>
{% endif %}
</header>
</div>

<!-- MAIN CONTENT -->
<div id="main_content_wrap" class="outer">
<section id="main_content" class="inner">
{{ content }}
</section>
</div>

<!-- FOOTER -->
<div id="footer_wrap" class="outer">
<footer class="inner">
{% if site.github.is_project_page %}
<p class="copyright">{{ site.title | default: site.github.repository_name }} maintained by <a href="{{ site.github.owner_url }}">{{ site.github.owner_name }}</a></p>
{% endif %}
<p>Published with <a href="https://pages.github.com">GitHub Pages</a></p>
</footer>
</div>

{% if site.google_analytics %}
<script>
(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
})(window,document,'script','//www.google-analytics.com/analytics.js','ga');
ga('create', '{{ site.google_analytics }}', 'auto');
ga('send', 'pageview');
</script>
{% endif %}
</body>
</html>
38 changes: 38 additions & 0 deletions _layouts/default.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
<!DOCTYPE html>
<html lang="{{ site.lang | default: "en-US" }}">
<head>
<meta charset="UTF-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1">

{% seo %}
<link rel="stylesheet" href="{{ "/assets/css/style.css?v=" | append: site.github.build_revision | relative_url }}">
</head>
<body>
<div class="container-lg px-3 my-5 markdown-body">
{% if site.title and site.title != page.title %}
<h1><a href="{{ "/" | absolute_url }}">{{ site.title }}</a></h1>
{% endif %}

{{ content }}

{% if site.github.private != true and site.github.license %}
<div class="footer border-top border-gray-light mt-5 pt-3 text-right text-gray">
This site is open source. {% github_edit_link "Improve this page" %}.
</div>
{% endif %}
</div>
<script src="https://cdnjs.cloudflare.com/ajax/libs/anchor-js/4.1.0/anchor.min.js" integrity="sha256-lZaRhKri35AyJSypXXs4o6OPFTbTmUoltBbDCbdzegg=" crossorigin="anonymous"></script>
<script>anchors.add();</script>
{% if site.google_analytics %}
<script>
(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
})(window,document,'script','//www.google-analytics.com/analytics.js','ga');
ga('create', '{{ site.google_analytics }}', 'auto');
ga('send', 'pageview');
</script>
{% endif %}
</body>
</html>
1 change: 1 addition & 0 deletions assets/css/style.scss
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
@import "{{ site.theme }}";
Binary file added images/interrater-reliability_hs.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/interrater-reliability_mhs.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/logo-ccwi.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/wordcloud_hatespeech_50.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/wordcloud_misogynistic_hatespeech_50.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/wordcloud_neutral_50.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 6066b18

Please sign in to comment.