Awesome NLP Resources for Dutch

A curated list of Resources for Dutch natural language processing (NLP).

Data

OSCAR (Open Super-large Crawled ALMAnaCH coRpus) (~50 GB) is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture. Data is distributed by language in both original and deduplicated form.
Dutch medical NLP (0.6 GB) is collection of Dutch medical texts which were used for domain-adaptive pretraining to pretrain Dutch medical language models
DpgMedia2019: A Dutch News Dataset for Partisanship Detection (0.3 GB) is a dataset consisting of news articles of several Dutch newspapers owned by DPG Media, plus annotations of the perceived partisanship of each of article.
DBRD: Dutch Book Reviews Dataset (0.2 GB) contains over 110k book reviews along with associated binary sentiment polarity labels.

RobBERT: Dutch RoBERTa-based Language Model. RobBERT is the state-of-the-art Dutch BERT model. It is a large pre-trained general Dutch language model that can be fine-tuned on a given dataset to perform any text classification, regression or token-tagging task.
BERTje: A Dutch BERT model. BERTje is a Dutch pre-trained BERT model developed at the University of Groningen.
GPT-2 Recycled for Italian and Dutch a multi-stage adaptation method for transfering GPT-2 to Dutch without unnecessary retraining

belabBERT 🤧 a RobBERT model fine-tuned to the classification of psychiatric illnesses
Dutch Word2Vec Model a Word2Vec model trained on a large Dutch corpus, comprised of social media messages and posts from Dutch news, blog and fora.

DeepFrog - NLP Suite DeepFrog aims to be a (partial) successor of the Dutch-NLP suite Frog. Whereas the various NLP modules in Frog wre built on k-NN classifiers, DeepFrog builds on deep learning techniques and can use a variety of neural transformers.

Deduce: de-identification method for Dutch medical text, a de-identification method for Dutch medical text
Dutch word list, by Stichting OpenTaal
Leipzig Corpora Collection: Dutch Web text corpus based, by Leipzig University
Dutch spelling checker, by Stichting OpenTaal

To the extent possible under law, Bram Zijlstra has waived all copyright and related or neighboring rights to this work.