Language modeling: Difference between revisions

← Older edit Newer edit →

VisualWikitext

Revision as of 18:13, 23 August 2025

Dutch Language Models

Hugging Face Dutch Models
BERTje: A Dutch pre-trained BERT model developed at the University of Groningen. Compared to the multilingual BERT model, which includes Dutch but is only based on Wikipedia text, BERTje is based on a large and diverse dataset of 2.4 billion tokens. The vocabulary size of BERTje has changed in 2021. Original paper: http://arxiv.org/abs/1912.09582
RobBERT: A RoBERTa-based state-of-the-art Dutch language model, which was released in 2020. original paper: https://arxiv.org/abs/1907.11692v1
RobBERT-2022: An update of the RobBERT Dutch language model to include new high-frequent tokens present in the latest Dutch OSCAR corpus from 2022. The RobBERT model was then pre-trained using this dataset. This model is a plug-in replacement for RobBERT and results in a significant performance increase for certain language tasks. Original paper: https://arxiv.org/abs/2211.08192v1
RobBERT-2023: A freshly pre-trained Dutch tokenizer using the latest version of the Dutch OSCAR corpus. This corpus incorporates new high-frequency terms, such as those related to the COVID-19 pandemic, cryptocurrencies, and the ongoing energy crisis, while mitigating the inclusion of previously over-represented terms from adult-oriented content. Unlike the prior versions of RobBERT, which relied on the training methodology of RoBERTa but required a fresh weight initialization, RobBERT-2023 is entirely initialized using the RoBERTa-large model. Original paper: https://clin33.uantwerpen.be/abstract/robbert-2023-keeping-dutch-language-models-up-to-date-at-a-lower-cost-thanks-to-model-conversion/ Related paper: https://arxiv.org/pdf/2310.03477
RobBERTje: A collection of distilled versions of the state-of-the-art Dutch RobBERT model. There are multiple models with different sizes and different training settings. Original paper: https://arxiv.org/abs/2204.13511v1
MedRoBERTa: MedRoBERTa.nl is one of the two encoder models worldwide that has been pre-trained on free text from real-world hospital data and is publicly available. The anonymized model allows NLP researchers and medical professionals to build medical text mining technology with a solid base model: the model can be fine-tuned for any task. Original paper: https://clinjournal.org/clinj/article/view/132
CoNTACT: CoNTACT (Contextual Neural Transformer Adapted to COVID-19 Tweets) is a Dutch RobBERT model (pdelobelle/robbert-v2-dutch-base) adapted to the domain of COVID-19 tweets.
ChocoLlama: A set of six Llama-2/3 based open models adapted to Dutch. Original paper: https://arxiv.org/html/2412.07633v1
Fietje 2: Fietje is a family of small open language models (SLMs) specifically designed for the Dutch language. The model is based on Phi 2, an English-centric model of 2.7 billion parameters. The fietje-2b-chat model is the one that is best suited as an assistant. Original paper: https://arxiv.org/abs/2412.15450
Tweety-7b-dutch: This is a foundation model with a focus on the Dutch language, incorporating a Dutch tokenizer for better understanding and generation of Dutch text. It is built on the mistral architecture. Original paper: https://arxiv.org/abs/2408.04303
Reynaerde 7B Chat: An open conversational model for Dutch, based on Mistral v0.3 Instruct. This model is a fine-tuned version of ReBatch/Reynaerde-7B-Instruct on ReBatch/ultrafeedback_nl.
Schaapje-2B-chat-V1.0:Schaapje is a small, powerfull Dutch Small Language Model. It has a good performance in Dutch conversations, Instruction following and RAG applications.
FinGEITJE: FinGEITje 7B is a large open Dutch financial language model with 7 billion parameters, based on Mistral 7B. It has been further trained on Dutch financial texts, enhancing its proficiency in the Dutch language and its knowledge of financial topics. As a result, FinGEITje provides more accurate and relevant responses in the domain of finance. Original paper: https://arxiv.org/abs/2410.12835
GPT-NL: A Dutch language model currently being developed by non-profit parties TNO, NFI and SURF, funded by the Dutch Ministry of Economic Affairs and Climate Policy. It is currently being trained and the first version is expected to be available in Q1 26.
GEITje: A Large Open Language Model. This model is no longer available: https://goingdutch.ai/en/posts/geitje-takedown

Multilingual Language Models including Dutch

MBart: Multilingual Denoising Pre-training for Neural Machine Translation
mT5: mT5: A massively multilingual pre-trained text-to-text transformer

NLLB: No Language Left Behind

SpaCy

spaCy is a free open-source library for Natural Language Processing in Python.

Dutch models

Language Modeling Benchmarks

DUMB

DUMB is a benchmark for evaluating the quality of language models for Dutch NLP tasks. The set of tasks is designed to be diverse and challenging, to test the limits of current language models. The specific datasets and formats are particularly suitable for fine-tuning encoder models, and applicability for large generative models is yet to be determined. Please read the paper for more details.

DuMB

LLM Leaderboard

This is a leaderboard for Dutch benchmarks for large language models.

Open Dutch LLM Leaderboard

Scandeval Dutch NLG

Scandeval Dutch NLG

n-gram modeling

Colibri core is an NLP tool as well as a C++ and Python library for working with basic linguistic constructions such as n-grams and skipgrams (i.e. patterns with one or more gaps, either of fixed or dynamic size) in a quick and memory-efficient way. At the core is the tool colibri-patternmodeller which allows you to build, view, manipulate and query pattern models.

Github repository

@@ Line 6: / Line 6: @@
 * [https://huggingface.co/models?search=dutch Hugging Face Dutch Models]
 * [https://github.com/wietsedv/bertje BERTje]: A Dutch pre-trained BERT model developed at the University of Groningen. Compared to the multilingual BERT model, which includes Dutch but is only based on Wikipedia text, BERTje is based on a large and diverse dataset of 2.4 billion tokens. The vocabulary size of BERTje has changed in 2021. Original paper: http://arxiv.org/abs/1912.09582
-* [https://huggingface.co/DTAI-KULeuven/robbert-2023-dutch-large RobBERT-2023]: A Dutch RoBERTa-based Language Model, an update of RobBERT-2022, which in its turn was an update of RobBERT, a RoBERTa-based state-of-the-art Dutch language model, which was trained in 2019. Original papers: RobBERT: https://arxiv.org/abs/1907.11692v1 and RobBERT-2022: https://arxiv.org/abs/2211.08192v1
+* [https://pieter.ai/robbert/ RobBERT]: A RoBERTa-based state-of-the-art Dutch language model, which was released in 2020. original paper: https://arxiv.org/abs/1907.11692v1
+* [https://pieter.ai/robbert-2022/ RobBERT-2022]: An update of the RobBERT Dutch language model to include new high-frequent tokens present in the latest Dutch OSCAR corpus from 2022. The RobBERT model was then pre-trained using this dataset. This model is a plug-in replacement for RobBERT and results in a significant performance increase for certain language tasks. Original paper: https://arxiv.org/abs/2211.08192v1
+* [https://pieter.ai/robbert-2023/ RobBERT-2023]: A freshly pre-trained Dutch tokenizer using the latest version of the Dutch OSCAR corpus. This corpus incorporates new high-frequency terms, such as those related to the COVID-19 pandemic, cryptocurrencies, and the ongoing energy crisis, while mitigating the inclusion of previously over-represented terms from adult-oriented content. Unlike the prior versions of RobBERT, which relied on the training methodology of RoBERTa but required a fresh weight initialization, RobBERT-2023 is entirely initialized using the RoBERTa-large model. Original paper: https://clin33.uantwerpen.be/abstract/robbert-2023-keeping-dutch-language-models-up-to-date-at-a-lower-cost-thanks-to-model-conversion/ Related paper: https://arxiv.org/pdf/2310.03477
 * [https://github.com/iPieter/robbertje RobBERTje]: A collection of distilled versions of the state-of-the-art Dutch RobBERT model. There are multiple models with different sizes and different training settings. Original paper: https://arxiv.org/abs/2204.13511v1
 * [https://huggingface.co/CLTL/MedRoBERTa.nl MedRoBERTa]: MedRoBERTa.nl is one of the two encoder models worldwide that has been pre-trained on free text from real-world hospital data and is publicly available. The anonymized model allows NLP researchers and medical professionals to build medical text mining technology with a solid base model: the model can be fine-tuned for any task. Original paper: https://clinjournal.org/clinj/article/view/132