A curated list of Resources for Dutch natural language processing (NLP).
- OSCAR (Open Super-large Crawled ALMAnaCH coRpus) (~50 GB) is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture. Data is distributed by language in both original and deduplicated form.
- Dutch medical NLP (0.6 GB) is collection of Dutch medical texts which were used for domain-adaptive pretraining to pretrain Dutch medical language models
- DpgMedia2019: A Dutch News Dataset for Partisanship Detection (0.3 GB) is a dataset consisting of news articles of several Dutch newspapers owned by DPG Media, plus annotations of the perceived partisanship of each of article.
- DBRD: Dutch Book Reviews Dataset (0.2 GB) contains over 110k book reviews along with associated binary sentiment polarity labels.
- RobBERT: Dutch RoBERTa-based Language Model. RobBERT is the state-of-the-art Dutch BERT model. It is a large pre-trained general Dutch language model that can be fine-tuned on a given dataset to perform any text classification, regression or token-tagging task.
- BERTje: A Dutch BERT model. BERTje is a Dutch pre-trained BERT model developed at the University of Groningen.
- GPT-2 Recycled for Italian and Dutch a multi-stage adaptation method for transfering GPT-2 to Dutch without unnecessary retraining
- belabBERT 🤧 a RobBERT model fine-tuned to the classification of psychiatric illnesses
- Dutch Word2Vec Model a Word2Vec model trained on a large Dutch corpus, comprised of social media messages and posts from Dutch news, blog and fora.
- DeepFrog - NLP Suite DeepFrog aims to be a (partial) successor of the Dutch-NLP suite Frog. Whereas the various NLP modules in Frog wre built on k-NN classifiers, DeepFrog builds on deep learning techniques and can use a variety of neural transformers.
- Training a Dutch GPT-2 base model
- Dutch GPT2: Autoregressive Language Modelling on a budget
- NLP with R part 5: State of the Art in NLP: Transformers & BERT
- Deduce: de-identification method for Dutch medical text, a de-identification method for Dutch medical text
- Dutch word list, by Stichting OpenTaal
- Leipzig Corpora Collection: Dutch Web text corpus based, by Leipzig University
- Dutch spelling checker, by Stichting OpenTaal
To the extent possible under law, Bram Zijlstra has waived all copyright and related or neighboring rights to this work.