Language modeling/en: Difference between revisions

VisualWikitext

Latest revision as of 14:59, 13 November 2025

Dutch Language Models

BERTje: A Dutch pre-trained BERT model developed at the University of Groningen. Compared to the multilingual BERT model, which includes Dutch but is only based on Wikipedia text, BERTje is based on a large and diverse dataset of 2.4 billion tokens. The vocabulary size of BERTje has changed in 2021. Encoder only. Original paper: http://arxiv.org/abs/1912.09582
RobBERT: A RoBERTa-based state-of-the-art Dutch language model, which was released in 2020. Encoder only. Original paper: https://arxiv.org/abs/2001.06286
RobBERT-2022: An update of the RobBERT Dutch language model to include new high-frequent tokens present in the latest Dutch OSCAR corpus from 2022. The RobBERT model was then pre-trained using this dataset. This model is a plug-in replacement for RobBERT and results in a significant performance increase for certain language tasks. Encoder only. Original paper: https://arxiv.org/abs/2211.08192v1
RobBERT-2023: A freshly pre-trained Dutch tokenizer using the latest version of the Dutch OSCAR corpus. This corpus incorporates new high-frequency terms, such as those related to the COVID-19 pandemic, cryptocurrencies, and the ongoing energy crisis. Unlike the prior versions of RobBERT, which relied on the training methodology of RoBERTa but required a fresh weight initialization, RobBERT-2023 is entirely initialized using the RoBERTa-large model. Encoder only. Original paper: https://www.clinjournal.org/clinj/article/view/180 Related paper: https://arxiv.org/pdf/2310.03477
RobBERTje: A collection of distilled versions of the state-of-the-art Dutch RobBERT model. There are multiple models with different sizes and different training settings. Encoder only. Original paper: https://arxiv.org/abs/2204.13511v1
MedRoBERTa: MedRoBERTa.nl is one of the two encoder models worldwide that has been pre-trained on free text from real-world hospital data and is publicly available. The anonymized model allows NLP researchers and medical professionals to build medical text mining technology with a solid base model: the model can be fine-tuned for any task. Decoder only. Original paper: https://clinjournal.org/clinj/article/view/132
belabBERT: A Dutch RoBERTa-based language model applied to psychiatric classification, pretrained on the Dutch unshuffled OSCAR corpus using the masked language modeling (MLM) objective. The model is case sensitive and includes punctuation. Decoder only. Original paper: https://arxiv.org/abs/2106.01091.
CoNTACT: CoNTACT (Contextual Neural Transformer Adapted to COVID-19 Tweets) is a Dutch RobBERT model (pdelobelle/robbert-v2-dutch-base) adapted to the domain of COVID-19 tweets. Encoder only. Original paper: https://arxiv.org/abs/2203.07362v1
ChocoLlama: A set of six Llama-2/3 based open models adapted to Dutch. Trained on Dutch language materials from both Belgium and the Nehtherlands. Decoder only. Original paper: https://arxiv.org/html/2412.07633v1
Fietje 2: Fietje is a family of small open language models (SLMs) specifically designed for the Dutch language. The model is based on Phi 2, an English-centric model of 2.7 billion parameters. The fietje-2b-chat model is the one that is best suited as an assistant. Decoder only. Original paper: https://arxiv.org/abs/2412.15450
Tweety-7b-dutch: This is a foundation model with a focus on the Dutch language, incorporating a Dutch tokenizer for better understanding and generation of Dutch text. It is built on the mistral architecture. Decoder only. Original paper: https://arxiv.org/abs/2408.04303
Reynaerde 7B Chat: An open conversational model for Dutch, based on Mistral v0.3 Instruct. This model is a fine-tuned version of https://huggingface.co/ReBatch/Reynaerde-7B-Instruct on https://huggingface.co/datasets/ReBatch/ultrafeedback_nl . Decoder only.
Schaapje-2B-chat-V1.0:Schaapje is a small, powerfull Dutch Small Language Model. It has a good performance in Dutch conversations, Instruction following and RAG applications. Is is based on the IBM Granite 3.0 2B Instruct model, decoder only.
GEITje 7B: A Large Open Dutch language model with 7 billion parameters, based on Mistral 7B. This model is no longer available: https://goingdutch.ai/en/posts/geitje-takedown
Geitje 7B Ultra A conversational model for Dutch. The GEITje model, derived from Mistral 7B, was enhanced through supervised finetuning with synthetic conversational datasets and preference alignment to improve its capabilities in Dutch. Decoder only. Original paper: https://arxiv.org/abs/2412.04092
FinGEITJE: FinGEITje 7B is a large open Dutch financial language model with 7 billion parameters, based on Mistral 7B. It has been further trained on Dutch financial texts, enhancing its proficiency in the Dutch language and its knowledge of financial topics. As a result, FinGEITje provides more accurate and relevant responses in the domain of finance. Decorder only. Original paper: https://arxiv.org/abs/2410.12835
GPT-NL: A Dutch language model currently being developed by non-profit parties TNO, NFI and SURF, funded by the Dutch Ministry of Economic Affairs and Climate Policy. It is currently being trained and the first version is expected to be available in Q1 26.
Hugging Face Dutch Models

Multilingual Language Models including Dutch

Aya Expanse 8B: Aya Expanse 8B is an open-weight research release of a model with highly advanced multilingual capabilities. Model Architecture: Aya Expanse 8B is an auto-regressive language model that uses an optimized transformer architecture. Post-training includes supervised finetuning, preference training, and model merging. Decoder-encoder model. Original paper: https://arxiv.org/abs/2412.04261
EUBERT: This is a pretrained BERT uncased model that has been trained on a vast corpus of documents registered by the European Publications Office. EUBERT serves as a starting point for building more specific natural language understanding models. Its versatility makes it suitable for a wide range of tasks, including but not limited to: text classification, question answering and language understanding. Model architecture: BERT (Bidirectional Encoder Representations from Transformers)
EuroLLM-1.7B: The EuroLLM project has the goal of creating a suite of LLMs capable of understanding and generating text in all European Union languages as well as some additional relevant languages. EuroLLM-1.7B is a 1.7B parameter model trained on 4 trillion tokens divided across the considered languages and several data sources: Web data, parallel data (en-xx and xx-en), and high-quality datasets. EuroLLM-1.7B-Instruct was further instruction tuned on EuroBlocks, an instruction tuning dataset with focus on general instruction-following and machine translation. Model type: A 1.7B parameter multilingual transfomer LLM. Original paper: https://arxiv.org/abs/2409.16235
EuroLLM-9B: The EuroLLM project has the goal of creating a suite of LLMs capable of understanding and generating text in all European Union languages as well as some additional relevant languages. EuroLLM-9B is a 9B parameter model trained on 4 trillion tokens divided across the considered languages and several data sources: Web data, parallel data (en-xx and xx-en), and high-quality datasets.EuroLLM-9B-Instruct was further instruction tuned on EuroBlocks, an instruction tuning dataset with focus on general instruction-following and machine translation. Model type: A 9B parameter multilingual transformer LLM. Original paper: https://arxiv.org/abs/2409.16235
Salamandra-7B-instruct: Salamandra is a highly multilingual model pre-trained from scratch that comes in three different sizes — 2B, 7B and 40B parameters — with their respective base and instruction-tuned variants. This model corresponds to the 7B instructed version. Model type: transformer-based decoder-only language model that has been pre-trained from scratch on 12.875 trillion tokens of highly curated data. The pre-training corpus contains text in 35 European languages and code. Original paper: https://arxiv.org/abs/2502.08489
Falcon H1: Falcon-H1 is the latest evolution in the Falcon family of large language models and is built upon an advanced hybrid architecture—where each block integrates both State Space Models (SSMs) and Attention Mechanisms. Falcon-H1 was initially trained with support for 18 core languages, including Dutch, with scalability to 100+ languages, achieving state-of-the-art multilingual and reasoning performances in instruction following, maths, coding, and multilingual tasks. Original paper: https://arxiv.org/abs/2507.22448
Neo Babel This is a novel multilingual image generation framework. The model is trained using a combination of large-scale multilingual pretraining and high-resolution instruction tuning. Original paper: https://arxiv.org/abs/2507.06137v1
Alfred-40b-1023: Alfred-40B-1023 can be used as a chat model or as an instruct model. It has limited capacities in Dutch. Model type: Causal decoder-only.
MBart: Multilingual Denoising Pre-training for Neural Machine Translation
mT5: mT5: A massively multilingual pre-trained text-to-text transformer

NLLB: No Language Left Behind

SpaCy

spaCy is a free open-source library for Natural Language Processing in Python.

Dutch models

Language Modeling Benchmarks

DUMB

DUMB is a benchmark for evaluating the quality of language models for Dutch NLP tasks. The set of tasks is designed to be diverse and challenging, to test the limits of current language models. The specific datasets and formats are particularly suitable for fine-tuning encoder models, and applicability for large generative models is yet to be determined. Original paper: https://arxiv.org/abs/2305.13026

DuMB

European LLM Leaderboard

This is a collection of multilingual evaluation results obtained using our fork of the LM-evaluation-harness (https://github.com/OpenGPTX/lm-evaluation-harness), based on V1 of the https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard. Currently, benchmarks are available in 21 European languages (Irish, Maltese, Croatian missing).

European LLM Leaderboard
Open Dutch LLM Leaderboard: This is discontinued.

Euroeval monolingual Dutch and multilingual Germanic

Euroeval Monolingual Dutch: This leaderboard does not only include monolingual Dutch models, but also monolingual models in other languages such as English and German, Scandinavian models, multilingual models where Dutch is not necessarily included etc. Note: this leaderboard "replaces" scandeval.com/dutch-nlg/.
Euroeval multilingual Germanic: This leaderboard includes Danish, Dutch, English, Faroese, German, Icelandic, Norwegian and Swedish.

n-gram modeling

Colibri core is an NLP tool as well as a C++ and Python library for working with basic linguistic constructions such as n-grams and skipgrams (i.e. patterns with one or more gaps, either of fixed or dynamic size) in a quick and memory-efficient way. At the core is the tool colibri-patternmodeller which allows you to build, view, manipulate and query pattern models.

Github repository

@@ Line 1: / Line 1: @@
-==n-gram modeling==
+<languages/>
-Colibri core is an NLP tool as well as a C++ and Python library for working with basic linguistic constructions such as n-grams and skipgrams (i.e. patterns with one or more gaps, either of fixed or dynamic size) in a quick and memory-efficient way. At the core is the tool colibri-patternmodeller which allows you to build, view, manipulate and query pattern models.
+==Dutch Language Models==
-*[http://proycon.github.io/colibri-core/ Github repository]
+* [https://github.com/wietsedv/bertje BERTje]: A Dutch pre-trained BERT model developed at the University of Groningen. Compared to the multilingual BERT model, which includes Dutch but is only based on Wikipedia text, BERTje is based on a large and diverse dataset of 2.4 billion tokens. The vocabulary size of BERTje has changed in 2021. Encoder only. Original paper: http://arxiv.org/abs/1912.09582
+* [https://pieter.ai/robbert/ RobBERT]: A RoBERTa-based state-of-the-art Dutch language model, which was released in 2020. Encoder only. Original paper: https://arxiv.org/abs/2001.06286
+* [https://pieter.ai/robbert-2022/ RobBERT-2022]: An update of the RobBERT Dutch language model to include new high-frequent tokens present in the latest Dutch OSCAR corpus from 2022. The RobBERT model was then pre-trained using this dataset. This model is a plug-in replacement for RobBERT and results in a significant performance increase for certain language tasks. Encoder only. Original paper: https://arxiv.org/abs/2211.08192v1
+* [https://pieter.ai/robbert-2023/ RobBERT-2023]: A freshly pre-trained Dutch tokenizer using the latest version of the Dutch OSCAR corpus. This corpus incorporates new high-frequency terms, such as those related to the COVID-19 pandemic, cryptocurrencies, and the ongoing energy crisis. Unlike the prior versions of RobBERT, which relied on the training methodology of RoBERTa but required a fresh weight initialization, RobBERT-2023 is entirely initialized using the RoBERTa-large model. Encoder only. Original paper: https://www.clinjournal.org/clinj/article/view/180 Related paper: https://arxiv.org/pdf/2310.03477
+* [https://github.com/iPieter/robbertje RobBERTje]: A collection of distilled versions of the state-of-the-art Dutch RobBERT model. There are multiple models with different sizes and different training settings. Encoder only. Original paper: https://arxiv.org/abs/2204.13511v1
+* [https://huggingface.co/CLTL/MedRoBERTa.nl MedRoBERTa]: MedRoBERTa.nl is one of the two encoder models worldwide that has been pre-trained on free text from real-world hospital data and is publicly available. The anonymized model allows NLP researchers and medical professionals to build medical text mining technology with a solid base model: the model can be fine-tuned for any task. Decoder only. Original paper: https://clinjournal.org/clinj/article/view/132
+* [https://github.com/Joppewouts/belabBERT belabBERT:] A Dutch RoBERTa-based language model applied to psychiatric classification, pretrained on the Dutch unshuffled OSCAR corpus using the masked language modeling (MLM) objective. The model is case sensitive and includes punctuation. Decoder only. Original paper: https://arxiv.org/abs/2106.01091.
+* [https://huggingface.co/clips/contact CoNTACT]: CoNTACT (Contextual Neural Transformer Adapted to COVID-19 Tweets) is a Dutch RobBERT model (pdelobelle/robbert-v2-dutch-base) adapted to the domain of COVID-19 tweets. Encoder only. Original paper: https://arxiv.org/abs/2203.07362v1
+* [https://github.com/ChocoLlamaModel/ChocoLlama ChocoLlama]: A set of six Llama-2/3 based open models adapted to Dutch. Trained on Dutch language materials from both Belgium and the Nehtherlands. Decoder only. Original paper: https://arxiv.org/html/2412.07633v1
+*[https://github.com/BramVanroy/fietje-2 Fietje 2]: Fietje is a family of small open language models (SLMs) specifically designed for the Dutch language. The model is based on Phi 2, an English-centric model of 2.7 billion parameters. The fietje-2b-chat model is the one that is best suited as an assistant. Decoder only. Original paper: https://arxiv.org/abs/2412.15450
+*[https://huggingface.co/Tweeties/tweety-7b-dutch-v24a Tweety-7b-dutch]: This is a foundation model with a focus on the Dutch language, incorporating a Dutch tokenizer for better understanding and generation of Dutch text. It is built on the mistral architecture. Decoder only. Original paper: https://arxiv.org/abs/2408.04303
+*[https://huggingface.co/ReBatch/Reynaerde-7B-Chat Reynaerde 7B Chat]: An open conversational model for Dutch, based on Mistral v0.3 Instruct. This model is a fine-tuned version of https://huggingface.co/ReBatch/Reynaerde-7B-Instruct on https://huggingface.co/datasets/ReBatch/ultrafeedback_nl . Decoder only.
+* [https://huggingface.co/robinsmits/Schaapje-2B-Chat-V1.0 Schaapje-2B-chat-V1.0]:Schaapje is a small, powerfull Dutch Small Language Model. It has a good performance in Dutch conversations, Instruction following and RAG applications. Is is based on the IBM Granite 3.0 2B Instruct model, decoder only.
+* [https://github.com/Rijgersberg/GEITje GEITje 7B]: A Large Open Dutch language model with 7 billion parameters, based on Mistral 7B. This model is no longer available: https://goingdutch.ai/en/posts/geitje-takedown
+* [https://huggingface.co/papers/2412.04092 Geitje 7B Ultra] A conversational model for Dutch. The GEITje model, derived from Mistral 7B, was enhanced through supervised finetuning with synthetic conversational datasets and preference alignment to improve its capabilities in Dutch. Decoder only. Original paper: https://arxiv.org/abs/2412.04092
+* [https://huggingface.co/snoels/FinGEITje-7B-sft FinGEITJE]: FinGEITje 7B is a large open Dutch financial language model with 7 billion parameters, based on Mistral 7B. It has been further trained on Dutch financial texts, enhancing its proficiency in the Dutch language and its knowledge of financial topics. As a result, FinGEITje provides more accurate and relevant responses in the domain of finance. Decorder only. Original paper: https://arxiv.org/abs/2410.12835
+*[https://GPT-NL.nl GPT-NL]: A Dutch language model currently being developed by non-profit parties TNO, NFI and SURF, funded by the Dutch Ministry of Economic Affairs and Climate Policy. It is currently being trained and the first version is expected to be available in Q1 26.
+* [https://huggingface.co/models?search=dutch Hugging Face Dutch Models]
-==Large Language Models==
-* [https://huggingface.co/models?search=dutch Hugging Face Dutch Models]
-* [https://people.cs.kuleuven.be/~pieter.delobelle/robbert/ RobBERT]: A Dutch RoBERTa-based Language Model
-* [https://github.com/wietsedv/bertje BERTje]: A Dutch BERT model
-* [https://github.com/Rijgersberg/GEITje GEITje]: A Large Open Language Model
 ==Multilingual Language Models including Dutch==
-* [https://openai.com/ GPT-3]
-* [https://huggingface.co/docs/transformers/model_doc/mbart MBart]
+* [https://huggingface.co/CohereLabs/aya-expanse-8b Aya Expanse 8B]: Aya Expanse 8B is an open-weight research release of a model with highly advanced multilingual capabilities. Model Architecture: Aya Expanse 8B is an auto-regressive language model that uses an optimized transformer architecture. Post-training includes supervised finetuning, preference training, and model merging. Decoder-encoder model. Original paper: https://arxiv.org/abs/2412.04261
+* [https://huggingface.co/EuropeanParliament/EUBERT EUBERT]: This is a pretrained BERT uncased model that has been trained on a vast corpus of documents registered by the European Publications Office. EUBERT serves as a starting point for building more specific natural language understanding models. Its versatility makes it suitable for a wide range of tasks, including but not limited to: text classification, question answering and language understanding. Model architecture: BERT (Bidirectional Encoder Representations from Transformers)
+* [https://huggingface.co/utter-project/EuroLLM-1.7B EuroLLM-1.7B]: The EuroLLM project has the goal of creating a suite of LLMs capable of understanding and generating text in all European Union languages as well as some additional relevant languages. EuroLLM-1.7B is a 1.7B parameter model trained on 4 trillion tokens divided across the considered languages and several data sources: Web data, parallel data (en-xx and xx-en), and high-quality datasets. EuroLLM-1.7B-Instruct was further instruction tuned on EuroBlocks, an instruction tuning dataset with focus on general instruction-following and machine translation. Model type: A 1.7B parameter multilingual transfomer LLM. Original paper: https://arxiv.org/abs/2409.16235
+* [https://huggingface.co/utter-project/EuroLLM-9B EuroLLM-9B]: The EuroLLM project has the goal of creating a suite of LLMs capable of understanding and generating text in all European Union languages as well as some additional relevant languages. EuroLLM-9B is a 9B parameter model trained on 4 trillion tokens divided across the considered languages and several data sources: Web data, parallel data (en-xx and xx-en), and high-quality datasets.EuroLLM-9B-Instruct was further instruction tuned on EuroBlocks, an instruction tuning dataset with focus on general instruction-following and machine translation. Model type: A 9B parameter multilingual transformer LLM. Original paper: https://arxiv.org/abs/2409.16235
+* [https://huggingface.co/BSC-LT/salamandra-7b-instruct Salamandra-7B-instruct]: Salamandra is a highly multilingual model pre-trained from scratch that comes in three different sizes — 2B, 7B and 40B parameters — with their respective base and instruction-tuned variants. This model corresponds to the 7B instructed version. Model type: transformer-based decoder-only language model that has been pre-trained from scratch on 12.875 trillion tokens of highly curated data. The pre-training corpus contains text in 35 European languages and code. Original paper: https://arxiv.org/abs/2502.08489
+* [https://github.com/tiiuae/falcon-h1 Falcon H1]: Falcon-H1 is the latest evolution in the Falcon family of large language models and is built upon an advanced hybrid architecture—where each block integrates both State Space Models (SSMs) and Attention Mechanisms. Falcon-H1 was initially trained with support for 18 core languages, including Dutch, with scalability to 100+ languages, achieving state-of-the-art multilingual and reasoning performances in instruction following, maths, coding, and multilingual tasks. Original paper: https://arxiv.org/abs/2507.22448
+* [https://neo-babel.github.io/ Neo Babel] This is a novel multilingual image generation framework. The model is trained using a combination of large-scale multilingual pretraining and high-resolution instruction tuning. Original paper: https://arxiv.org/abs/2507.06137v1
+* [https://huggingface.co/lightonai/alfred-40b-1023 Alfred-40b-1023]: Alfred-40B-1023 can be used as a chat model or as an instruct model. It has limited capacities in Dutch. Model type: Causal decoder-only.
+* [https://huggingface.co/docs/transformers/model_doc/mbart MBart]:  Multilingual Denoising Pre-training for Neural Machine Translation
+* [https://huggingface.co/docs/transformers/v4.14.1/model_doc/mt5 mT5:]  mT5: A massively multilingual pre-trained text-to-text transformer
+* [https://huggingface.co/docs/transformers/model_doc/nllb NLLB]: No Language Left Behind
 ==SpaCy==
 spaCy is a free open-source library for Natural Language Processing in Python.
@@ Line 20: / Line 45: @@
 == Language Modeling Benchmarks ==
 ===DUMB===
-DUMB is a benchmark for evaluating the quality of language models for Dutch NLP tasks. The set of tasks is designed to be diverse and challenging, to test the limits of current language models. The specific datasets and formats are particularly suitable for fine-tuning encoder models, and applicability for large generative models is yet to be determined. Please read the paper for more details.
+DUMB is a benchmark for evaluating the quality of language models for Dutch NLP tasks. The set of tasks is designed to be diverse and challenging, to test the limits of current language models. The specific datasets and formats are particularly suitable for fine-tuning encoder models, and applicability for large generative models is yet to be determined. Original paper: https://arxiv.org/abs/2305.13026
 * [https://dumbench.nl/ DuMB]
-===LLM Leaderboard===
-This is a leaderboard for Dutch benchmarks for large language models.
-* [https://huggingface.co/spaces/BramVanroy/open_dutch_llm_leaderboard Open Dutch LLM Leaderboard]
+===European LLM Leaderboard===
+This is a collection of multilingual evaluation results obtained using our fork of the LM-evaluation-harness (https://github.com/OpenGPTX/lm-evaluation-harness), based on V1 of the https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard. Currently, benchmarks are available in 21 European languages (Irish, Maltese, Croatian missing).
+* [https://huggingface.co/spaces/Eurolingua/european-llm-leaderboard European LLM Leaderboard]
+* Open Dutch LLM Leaderboard: This is discontinued.
+=== Euroeval monolingual Dutch and multilingual Germanic===
+*[https://euroeval.com/leaderboards/Monolingual/dutch/ Euroeval Monolingual Dutch]: This leaderboard does not only include monolingual Dutch models, but also monolingual models in other languages such as English and German, Scandinavian models, multilingual models where Dutch is not necessarily included etc. Note: this leaderboard "replaces" scandeval.com/dutch-nlg/.
+*[https://euroeval.com/leaderboards/Multilingual/germanic/ Euroeval multilingual Germanic]: This leaderboard includes Danish, Dutch, English, Faroese, German, Icelandic, Norwegian and Swedish.
+==n-gram modeling==
+Colibri core is an NLP tool as well as a C++ and Python library for working with basic linguistic constructions such as n-grams and skipgrams (i.e. patterns with one or more gaps, either of fixed or dynamic size) in a quick and memory-efficient way. At the core is the tool colibri-patternmodeller which allows you to build, view, manipulate and query pattern models.
+*[http://proycon.github.io/colibri-core/ Github repository]