Jump to content
Main menu
Main menu
move to sidebar
hide
Navigation
Main page
Recent changes
Random page
Help about MediaWiki
Clarin K-Centre
Search
Search
English
Appearance
Log in
Personal tools
Log in
Export translations
Translate
English
Language statistics
Message group statistics
Export
Tools
Tools
move to sidebar
hide
Actions
Language statistics
Message group statistics
Export
General
Special pages
Printable version
Appearance
move to sidebar
hide
Settings
Group
Basic language processing
Best practice documents and guidelines
Character N-grams
CLARIN
CLARIN projects
Clinical NLP
Compound splitting
Computer-mediated communication corpora
Conceptual resources
Consulting
Coreference resolution
Corpora
Corpora of academic texts
Corpus querying
Deep parsing
Dictionaries
Embeddings
Format conversion
Grammar
Historical corpora
Internships
K-Dutch
L2 learner corpora
Language Learning Resources
Language modeling
Lexica
Lexica of terminology
Lexicography
Machine translation
Manually annotated corpora
Multimodal corpora
Newspaper corpora
Ontologies
Other corpora
Parallel corpora
Parallel Monolingual Corpora
Parallel Multilingual Corpora
Parliamentary corpora
Propbanks
Q&A
Readability
Reference corpora
Sentiment analysis
Sign Language corpora
Simplification Data
Social media corpora
Speech recognition
Speech synthesis
Spell checking
Spelling
Spoken corpora
Spoken language recognition
Syllabification
Terminology
Text classification
Text simplification
Treebank querying
Treebanks
Word sense disambiguation
Wordlists
Language
aa - Afar
aae - Arbëresh
ab - Abkhazian
abs - Ambonese Malay
ace - Achinese
acf - Kwéyòl Sent Lisi
acm - Iraqi Arabic
ady - Adyghe
ady-cyrl - Adyghe (Cyrillic script)
aeb - Tunisian Arabic
aeb-arab - Tunisian Arabic (Arabic script)
aeb-latn - Tunisian Arabic (Latin script)
af - Afrikaans
aln - Gheg Albanian
alt - Southern Altai
am - Amharic
ami - Amis
an - Aragonese
ang - Old English
ann - Obolo
anp - Angika
apc - Levantine Arabic
ar - Arabic
arc - Aramaic
arn - Mapuche
arq - Algerian Arabic
ary - Moroccan Arabic
arz - Egyptian Arabic
as - Assamese
ase - American Sign Language
ast - Asturian
atj - Atikamekw
av - Avaric
avk - Kotava
awa - Awadhi
ay - Aymara
az - Azerbaijani
azb - South Azerbaijani
ba - Bashkir
ban - Balinese
ban-bali - Balinese (Balinese script)
bar - Bavarian
bbc - Batak Toba
bbc-latn - Batak Toba (Latin script)
bcc - Southern Balochi
bci - Baoulé
bcl - Central Bikol
bdr - West Coast Bajau
be - Belarusian
be-tarask - Belarusian (Taraškievica orthography)
bew - Betawi
bg - Bulgarian
bgc - Haryanvi
bgn - Western Balochi
bh - Bhojpuri
bho - Bhojpuri
bi - Bislama
bjn - Banjar
blk - Pa'O
bm - Bambara
bn - Bangla
bo - Tibetan
bpy - Bishnupriya
bqi - Bakhtiari
br - Breton
brh - Brahui
bs - Bosnian
btm - Batak Mandailing
bto - Iriga Bicolano
bug - Buginese
bxr - Russia Buriat
ca - Catalan
cbk-zam - Chavacano
ccp - Chakma
cdo - Mindong
ce - Chechen
ceb - Cebuano
ch - Chamorro
chn - Chinook Jargon
cho - Choctaw
chr - Cherokee
chy - Cheyenne
ckb - Central Kurdish
co - Corsican
cps - Capiznon
cpx - Pu–Xian Min
cpx-hans - Pu–Xian Min (Simplified Han script)
cpx-hant - Pu–Xian Min (Traditional Han script)
cpx-latn - Pu–Xian Min (Latin script)
cr - Cree
crh - Crimean Tatar
crh-cyrl - Crimean Tatar (Cyrillic script)
crh-latn - Crimean Tatar (Latin script)
crh-ro - Dobrujan Tatar
cs - Czech
csb - Kashubian
cu - Church Slavic
cv - Chuvash
cy - Welsh
da - Danish
dag - Dagbani
de - German
de-at - Austrian German
de-ch - Swiss High German
de-formal - German (formal address)
dga - Dagaare
din - Dinka
diq - Zazaki
dsb - Lower Sorbian
dtp - Central Dusun
dty - Doteli
dua - Duala
dv - Divehi
dz - Dzongkha
ee - Ewe
efi - Efik
egl - Emilian
el - Greek
eml - Emiliano-Romagnolo
en - English
en-ca - Canadian English
en-gb - British English
eo - Esperanto
es - Spanish
es-419 - Latin American Spanish
es-formal - Spanish (formal address)
et - Estonian
eu - Basque
ext - Extremaduran
fa - Persian
fat - Fanti
ff - Fula
fi - Finnish
fit - Tornedalen Finnish
fj - Fijian
fo - Faroese
fon - Fon
fr - French
frc - Cajun French
frp - Arpitan
frr - Northern Frisian
fur - Friulian
fy - Western Frisian
ga - Irish
gaa - Ga
gag - Gagauz
gan - Gan
gan-hans - Gan (Simplified Han script)
gan-hant - Gan (Traditional Han script)
gcf - Guadeloupean Creole
gcr - Guianan Creole
gd - Scottish Gaelic
gl - Galician
gld - Nanai
glk - Gilaki
gn - Guarani
gom - Goan Konkani
gom-deva - Goan Konkani (Devanagari script)
gom-latn - Goan Konkani (Latin script)
gor - Gorontalo
got - Gothic
gpe - Ghanaian Pidgin
grc - Ancient Greek
gsw - Alemannic
gu - Gujarati
guc - Wayuu
gur - Frafra
guw - Gun
gv - Manx
ha - Hausa
hak - Hakka Chinese
hak-hans - Hakka (Simplified Han script)
hak-hant - Hakka (Traditional Han script)
hak-latn - Hak-kâ-ngî (Pha̍k-fa-sṳ)
haw - Hawaiian
he - Hebrew
hi - Hindi
hif - Fiji Hindi
hif-latn - Fiji Hindi (Latin script)
hil - Hiligaynon
hno - Northern Hindko
ho - Hiri Motu
hr - Croatian
hrx - Hunsrik
hsb - Upper Sorbian
hsn - Xiang
ht - Haitian Creole
hu - Hungarian
hu-formal - Hungarian (formal address)
hy - Armenian
hyw - Western Armenian
hz - Herero
ia - Interlingua
iba - Iban
ibb - Ibibio
id - Indonesian
ie - Interlingue
ig - Igbo
igl - Igala
ii - Sichuan Yi
ik - Inupiaq
ike-cans - Eastern Canadian (Aboriginal syllabics)
ike-latn - Eastern Canadian (Latin script)
ilo - Iloko
inh - Ingush
io - Ido
is - Icelandic
isv-cyrl - меджусловјанскы
isv-latn - Interslavic (Latin script)
it - Italian
iu - Inuktitut
ja - Japanese
jam - Jamaican Creole English
jbo - Lojban
jut - Jutish
jv - Javanese
ka - Georgian
kaa - Kara-Kalpak
kab - Kabyle
kai - Karekare
kbd - Kabardian
kbd-cyrl - Kabardian (Cyrillic script)
kbp - Kabiye
kcg - Tyap
kea - Kabuverdianu
kg - Kongo
kge - Komering
khw - Khowar
ki - Kikuyu
kiu - Kirmanjki
kj - Kuanyama
kjh - Khakas
kjp - Eastern Pwo
kk - Kazakh
kk-arab - Kazakh (Arabic script)
kk-cn - Kazakh (China)
kk-cyrl - Kazakh (Cyrillic script)
kk-kz - Kazakh (Kazakhstan)
kk-latn - Kazakh (Latin script)
kk-tr - Kazakh (Turkey)
kl - Kalaallisut
km - Khmer
kn - Kannada
knc - Yerwa Kanuri
ko - Korean
ko-kp - Korean (North Korea)
koi - Komi-Permyak
kr - Kanuri
krc - Karachay-Balkar
kri - Krio
krj - Kinaray-a
krl - Karelian
ks - Kashmiri
ks-arab - Kashmiri (Arabic script)
ks-deva - Kashmiri (Devanagari script)
ksh - Colognian
ksw - S'gaw Karen
ku - Kurdish
ku-arab - Kurdish (Arabic script)
ku-latn - Kurdish (Latin script)
kum - Kumyk
kus - Kʋsaal
kv - Komi
kw - Cornish
ky - Kyrgyz
la - Latin
lad - Ladino
lb - Luxembourgish
lbe - Lak
lez - Lezghian
lfn - Lingua Franca Nova
lg - Ganda
li - Limburgish
lij - Ligurian
liv - Livonian
lki - Laki
lld - Ladin
lmo - Lombard
ln - Lingala
lo - Lao
loz - Lozi
lrc - Northern Luri
lt - Lithuanian
ltg - Latgalian
lua - Luba-Lulua
lus - Mizo
luz - Southern Luri
lv - Latvian
lzh - Literary Chinese
lzz - Laz
mad - Madurese
mag - Magahi
mai - Maithili
map-bms - Basa Banyumasan
mdf - Moksha
mg - Malagasy
mh - Marshallese
mhr - Eastern Mari
mi - Māori
min - Minangkabau
mk - Macedonian
ml - Malayalam
mn - Mongolian
mnc - Manchu
mnc-latn - Manchu (Latin script)
mnc-mong - Manchu (Mongolian script)
mni - Manipuri
mnw - Mon
mo - Moldovan
mos - Mossi
mr - Marathi
mrh - Mara
mrj - Western Mari
ms - Malay
ms-arab - Malay (Jawi script)
mt - Maltese
mui - Musi
mus - Muscogee
mwl - Mirandese
my - Burmese
myv - Erzya
mzn - Mazanderani
na - Nauru
nah - Nāhuatl
nan - Minnan
nan-hant - Minnan (Traditional Han script)
nan-latn-pehoeji - Minnan (Pe̍h-ōe-jī)
nan-latn-tailo - Minnan (Tâi-lô)
nap - Neapolitan
nb - Norwegian Bokmål
nds - Low German
nds-nl - Low Saxon
ne - Nepali
new - Newari
ng - Ndonga
nia - Nias
nit - కొలామి
niu - Niuean
nl - Dutch
nl-informal - Dutch (informal address)
nmz - Nawdm
nn - Norwegian Nynorsk
no - Norwegian
nod - Northern Thai
nog - Nogai
nov - Novial
nqo - N’Ko
nr - South Ndebele
nrm - Norman
nso - Northern Sotho
nup - Nupe
nv - Navajo
ny - Nyanja
nyn - Nyankole
nyo - Nyoro
nys - Nyungar
oc - Occitan
ojb - Northwestern Ojibwa
olo - Livvi-Karelian
om - Oromo
or - Odia
os - Ossetic
pa - Punjabi
pag - Pangasinan
pam - Pampanga
pap - Papiamento
pcd - Picard
pcm - Nigerian Pidgin
pdc - Pennsylvania German
pdt - Plautdietsch
pfl - Palatine German
pi - Pali
pih - Norfuk / Pitkern
pl - Polish
pms - Piedmontese
pnb - Western Punjabi
pnt - Pontic
prg - Prussian
ps - Pashto
pt - Portuguese
pt-br - Brazilian Portuguese
pwn - Paiwan
qqq - Message documentation
qu - Quechua
qug - Chimborazo Highland Quichua
rgn - Romagnol
rif - Riffian
rki - Arakanese
rm - Romansh
rmc - Carpathian Romani
rmy - Vlax Romani
rn - Rundi
ro - Romanian
roa-tara - Tarantino
rsk - Pannonian Rusyn
ru - Russian
rue - Rusyn
rup - Aromanian
ruq - Megleno-Romanian
ruq-cyrl - Megleno-Romanian (Cyrillic script)
ruq-latn - Megleno-Romanian (Latin script)
rut - Rutul
rw - Kinyarwanda
ryu - Okinawan
sa - Sanskrit
sah - Yakut
sat - Santali
sc - Sardinian
scn - Sicilian
sco - Scots
sd - Sindhi
sdc - Sassarese Sardinian
sdh - Southern Kurdish
se - Northern Sami
se-fi - Northern Sami (Finland)
se-no - Northern Sami (Norway)
se-se - Northern Sami (Sweden)
sei - Seri
ses - Koyraboro Senni
sg - Sango
sgs - Samogitian
sh - Serbo-Croatian
sh-cyrl - Serbo-Croatian (Cyrillic script)
sh-latn - Serbo-Croatian (Latin script)
shi - Tachelhit
shi-latn - Tachelhit (Latin script)
shi-tfng - Tachelhit (Tifinagh script)
shn - Shan
shy - Shawiya
shy-latn - Shawiya (Latin script)
si - Sinhala
simple - Simple English
sjd - Kildin Sami
sje - Pite Sami
sk - Slovak
skr - Saraiki
skr-arab - Saraiki (Arabic script)
sl - Slovenian
sli - Lower Silesian
sm - Samoan
sma - Southern Sami
smn - Inari Sami
sms - Skolt Sami
sn - Shona
so - Somali
sq - Albanian
sr - Serbian
sr-ec - Serbian (Cyrillic script)
sr-el - Serbian (Latin script)
srn - Sranan Tongo
sro - Campidanese Sardinian
ss - Swati
st - Southern Sotho
stq - Saterland Frisian
sty - Siberian Tatar
su - Sundanese
sv - Swedish
sw - Swahili
syl - Sylheti
szl - Silesian
szy - Sakizaya
ta - Tamil
tay - Tayal
tcy - Tulu
tdd - Tai Nuea
te - Telugu
tet - Tetum
tg - Tajik
tg-cyrl - Tajik (Cyrillic script)
tg-latn - Tajik (Latin script)
th - Thai
ti - Tigrinya
tig - Tigre
tk - Turkmen
tl - Tagalog
tly - Talysh
tly-cyrl - Talysh (Cyrillic script)
tn - Tswana
to - Tongan
tok - Toki Pona
tpi - Tok Pisin
tr - Turkish
tru - Turoyo
trv - Taroko
ts - Tsonga
tt - Tatar
tt-cyrl - Tatar (Cyrillic script)
tt-latn - Tatar (Latin script)
ttj - Tooro
tum - Tumbuka
tw - Twi
ty - Tahitian
tyv - Tuvinian
tzm - Central Atlas Tamazight
udm - Udmurt
ug - Uyghur
ug-arab - Uyghur (Arabic script)
ug-latn - Uyghur (Latin script)
uk - Ukrainian
ur - Urdu
uz - Uzbek
uz-cyrl - Uzbek (Cyrillic script)
uz-latn - Uzbek (Latin script)
ve - Venda
vec - Venetian
vep - Veps
vi - Vietnamese
vls - West Flemish
vmf - Main-Franconian
vmw - Makhuwa
vo - Volapük
vot - Votic
vro - Võro
wa - Walloon
wal - Wolaytta
war - Waray
wls - Wallisian
wo - Wolof
wuu - Wu
wuu-hans - Wu (Simplified Han script)
wuu-hant - Wu (Traditional Han script)
xal - Kalmyk
xh - Xhosa
xmf - Mingrelian
xsy - Saisiyat
yi - Yiddish
yo - Yoruba
yrl - Nheengatu
yue - Cantonese
yue-hans - Cantonese (Simplified Han script)
yue-hant - Cantonese (Traditional Han script)
za - Zhuang
zea - Zeelandic
zgh - Standard Moroccan Tamazight
zgh-latn - tamaziɣt tanawayt
zh - Chinese
zh-cn - Chinese (China)
zh-hans - Simplified Chinese
zh-hant - Traditional Chinese
zh-hk - Chinese (Hong Kong)
zh-mo - Chinese (Macau)
zh-my - Chinese (Malaysia)
zh-sg - Chinese (Singapore)
zh-tw - Chinese (Taiwan)
zu - Zulu
Format
Export for off-line translation
Export in native format
Export in CSV format
Fetch
<languages/> <div lang="en" dir="ltr" class="mw-content-ltr"> ==Dutch Language Models== </div> <div lang="en" dir="ltr" class="mw-content-ltr"> * [https://github.com/wietsedv/bertje BERTje]: A Dutch pre-trained BERT model developed at the University of Groningen. Compared to the multilingual BERT model, which includes Dutch but is only based on Wikipedia text, BERTje is based on a large and diverse dataset of 2.4 billion tokens. The vocabulary size of BERTje has changed in 2021. Encoder only. Original paper: http://arxiv.org/abs/1912.09582 * [https://pieter.ai/robbert/ RobBERT]: A RoBERTa-based state-of-the-art Dutch language model, which was released in 2020. Encoder only. Original paper: https://arxiv.org/abs/2001.06286 * [https://pieter.ai/robbert-2022/ RobBERT-2022]: An update of the RobBERT Dutch language model to include new high-frequent tokens present in the latest Dutch OSCAR corpus from 2022. The RobBERT model was then pre-trained using this dataset. This model is a plug-in replacement for RobBERT and results in a significant performance increase for certain language tasks. Encoder only. Original paper: https://arxiv.org/abs/2211.08192v1 * [https://pieter.ai/robbert-2023/ RobBERT-2023]: A freshly pre-trained Dutch tokenizer using the latest version of the Dutch OSCAR corpus. This corpus incorporates new high-frequency terms, such as those related to the COVID-19 pandemic, cryptocurrencies, and the ongoing energy crisis. Unlike the prior versions of RobBERT, which relied on the training methodology of RoBERTa but required a fresh weight initialization, RobBERT-2023 is entirely initialized using the RoBERTa-large model. Encoder only. Original paper: https://www.clinjournal.org/clinj/article/view/180 Related paper: https://arxiv.org/pdf/2310.03477 * [https://github.com/iPieter/robbertje RobBERTje]: A collection of distilled versions of the state-of-the-art Dutch RobBERT model. There are multiple models with different sizes and different training settings. Encoder only. Original paper: https://arxiv.org/abs/2204.13511v1 * [https://huggingface.co/CLTL/MedRoBERTa.nl MedRoBERTa]: MedRoBERTa.nl is one of the two encoder models worldwide that has been pre-trained on free text from real-world hospital data and is publicly available. The anonymized model allows NLP researchers and medical professionals to build medical text mining technology with a solid base model: the model can be fine-tuned for any task. Decoder only. Original paper: https://clinjournal.org/clinj/article/view/132 * [https://github.com/Joppewouts/belabBERT belabBERT:] A Dutch RoBERTa-based language model applied to psychiatric classification, pretrained on the Dutch unshuffled OSCAR corpus using the masked language modeling (MLM) objective. The model is case sensitive and includes punctuation. Decoder only. Original paper: https://arxiv.org/abs/2106.01091. * [https://huggingface.co/clips/contact CoNTACT]: CoNTACT (Contextual Neural Transformer Adapted to COVID-19 Tweets) is a Dutch RobBERT model (pdelobelle/robbert-v2-dutch-base) adapted to the domain of COVID-19 tweets. Encoder only. Original paper: https://arxiv.org/abs/2203.07362v1 * [https://github.com/ChocoLlamaModel/ChocoLlama ChocoLlama]: A set of six Llama-2/3 based open models adapted to Dutch. Trained on Dutch language materials from both Belgium and the Nehtherlands. Decoder only. Original paper: https://arxiv.org/html/2412.07633v1 *[https://github.com/BramVanroy/fietje-2 Fietje 2]: Fietje is a family of small open language models (SLMs) specifically designed for the Dutch language. The model is based on Phi 2, an English-centric model of 2.7 billion parameters. The fietje-2b-chat model is the one that is best suited as an assistant. Decoder only. Original paper: https://arxiv.org/abs/2412.15450 *[https://huggingface.co/Tweeties/tweety-7b-dutch-v24a Tweety-7b-dutch]: This is a foundation model with a focus on the Dutch language, incorporating a Dutch tokenizer for better understanding and generation of Dutch text. It is built on the mistral architecture. Decoder only. Original paper: https://arxiv.org/abs/2408.04303 *[https://huggingface.co/ReBatch/Reynaerde-7B-Chat Reynaerde 7B Chat]: An open conversational model for Dutch, based on Mistral v0.3 Instruct. This model is a fine-tuned version of https://huggingface.co/ReBatch/Reynaerde-7B-Instruct on https://huggingface.co/datasets/ReBatch/ultrafeedback_nl . Decoder only. * [https://huggingface.co/robinsmits/Schaapje-2B-Chat-V1.0 Schaapje-2B-chat-V1.0]:Schaapje is a small, powerfull Dutch Small Language Model. It has a good performance in Dutch conversations, Instruction following and RAG applications. Is is based on the IBM Granite 3.0 2B Instruct model, decoder only. * [https://github.com/Rijgersberg/GEITje GEITje 7B]: A Large Open Dutch language model with 7 billion parameters, based on Mistral 7B. This model is no longer available: https://goingdutch.ai/en/posts/geitje-takedown * [https://huggingface.co/papers/2412.04092 Geitje 7B Ultra] A conversational model for Dutch. The GEITje model, derived from Mistral 7B, was enhanced through supervised finetuning with synthetic conversational datasets and preference alignment to improve its capabilities in Dutch. Decoder only. Original paper: https://arxiv.org/abs/2412.04092 * [https://huggingface.co/snoels/FinGEITje-7B-sft FinGEITJE]: FinGEITje 7B is a large open Dutch financial language model with 7 billion parameters, based on Mistral 7B. It has been further trained on Dutch financial texts, enhancing its proficiency in the Dutch language and its knowledge of financial topics. As a result, FinGEITje provides more accurate and relevant responses in the domain of finance. Decorder only. Original paper: https://arxiv.org/abs/2410.12835 *[https://GPT-NL.nl GPT-NL]: A Dutch language model currently being developed by non-profit parties TNO, NFI and SURF, funded by the Dutch Ministry of Economic Affairs and Climate Policy. It is currently being trained and the first version is expected to be available in Q1 26. * [https://huggingface.co/models?search=dutch Hugging Face Dutch Models] </div> <div lang="en" dir="ltr" class="mw-content-ltr"> ==Multilingual Language Models including Dutch== </div> <div lang="en" dir="ltr" class="mw-content-ltr"> * [https://huggingface.co/CohereLabs/aya-expanse-8b Aya Expanse 8B]: Aya Expanse 8B is an open-weight research release of a model with highly advanced multilingual capabilities. Model Architecture: Aya Expanse 8B is an auto-regressive language model that uses an optimized transformer architecture. Post-training includes supervised finetuning, preference training, and model merging. Decoder-encoder model. Original paper: https://arxiv.org/abs/2412.04261 * [https://huggingface.co/EuropeanParliament/EUBERT EUBERT]: This is a pretrained BERT uncased model that has been trained on a vast corpus of documents registered by the European Publications Office. EUBERT serves as a starting point for building more specific natural language understanding models. Its versatility makes it suitable for a wide range of tasks, including but not limited to: text classification, question answering and language understanding. Model architecture: BERT (Bidirectional Encoder Representations from Transformers) * [https://huggingface.co/utter-project/EuroLLM-1.7B EuroLLM-1.7B]: The EuroLLM project has the goal of creating a suite of LLMs capable of understanding and generating text in all European Union languages as well as some additional relevant languages. EuroLLM-1.7B is a 1.7B parameter model trained on 4 trillion tokens divided across the considered languages and several data sources: Web data, parallel data (en-xx and xx-en), and high-quality datasets. EuroLLM-1.7B-Instruct was further instruction tuned on EuroBlocks, an instruction tuning dataset with focus on general instruction-following and machine translation. Model type: A 1.7B parameter multilingual transfomer LLM. Original paper: https://arxiv.org/abs/2409.16235 * [https://huggingface.co/utter-project/EuroLLM-9B EuroLLM-9B]: The EuroLLM project has the goal of creating a suite of LLMs capable of understanding and generating text in all European Union languages as well as some additional relevant languages. EuroLLM-9B is a 9B parameter model trained on 4 trillion tokens divided across the considered languages and several data sources: Web data, parallel data (en-xx and xx-en), and high-quality datasets.EuroLLM-9B-Instruct was further instruction tuned on EuroBlocks, an instruction tuning dataset with focus on general instruction-following and machine translation. Model type: A 9B parameter multilingual transformer LLM. Original paper: https://arxiv.org/abs/2409.16235 * [https://huggingface.co/BSC-LT/salamandra-7b-instruct Salamandra-7B-instruct]: Salamandra is a highly multilingual model pre-trained from scratch that comes in three different sizes — 2B, 7B and 40B parameters — with their respective base and instruction-tuned variants. This model corresponds to the 7B instructed version. Model type: transformer-based decoder-only language model that has been pre-trained from scratch on 12.875 trillion tokens of highly curated data. The pre-training corpus contains text in 35 European languages and code. Original paper: https://arxiv.org/abs/2502.08489 * [https://github.com/tiiuae/falcon-h1 Falcon H1]: Falcon-H1 is the latest evolution in the Falcon family of large language models and is built upon an advanced hybrid architecture—where each block integrates both State Space Models (SSMs) and Attention Mechanisms. Falcon-H1 was initially trained with support for 18 core languages, including Dutch, with scalability to 100+ languages, achieving state-of-the-art multilingual and reasoning performances in instruction following, maths, coding, and multilingual tasks. Original paper: https://arxiv.org/abs/2507.22448 * [https://neo-babel.github.io/ Neo Babel] This is a novel multilingual image generation framework. The model is trained using a combination of large-scale multilingual pretraining and high-resolution instruction tuning. Original paper: https://arxiv.org/abs/2507.06137v1 * [https://huggingface.co/lightonai/alfred-40b-1023 Alfred-40b-1023]: Alfred-40B-1023 can be used as a chat model or as an instruct model. It has limited capacities in Dutch. Model type: Causal decoder-only. * [https://huggingface.co/docs/transformers/model_doc/mbart MBart]: Multilingual Denoising Pre-training for Neural Machine Translation * [https://huggingface.co/docs/transformers/v4.14.1/model_doc/mt5 mT5:] mT5: A massively multilingual pre-trained text-to-text transformer </div> <div lang="en" dir="ltr" class="mw-content-ltr"> * [https://huggingface.co/docs/transformers/model_doc/nllb NLLB]: No Language Left Behind </div> <div lang="en" dir="ltr" class="mw-content-ltr"> ==SpaCy== </div> <div lang="en" dir="ltr" class="mw-content-ltr"> spaCy is a free open-source library for Natural Language Processing in Python. </div> <div lang="en" dir="ltr" class="mw-content-ltr"> * [https://spacy.io/models/nl Dutch models] </div> <div lang="en" dir="ltr" class="mw-content-ltr"> == Language Modeling Benchmarks == </div> <div lang="en" dir="ltr" class="mw-content-ltr"> ===DUMB=== </div> <div lang="en" dir="ltr" class="mw-content-ltr"> DUMB is a benchmark for evaluating the quality of language models for Dutch NLP tasks. The set of tasks is designed to be diverse and challenging, to test the limits of current language models. The specific datasets and formats are particularly suitable for fine-tuning encoder models, and applicability for large generative models is yet to be determined. Original paper: https://arxiv.org/abs/2305.13026 </div> <div lang="en" dir="ltr" class="mw-content-ltr"> * [https://dumbench.nl/ DuMB] </div> <div lang="en" dir="ltr" class="mw-content-ltr"> ===European LLM Leaderboard=== </div> <div lang="en" dir="ltr" class="mw-content-ltr"> This is a collection of multilingual evaluation results obtained using our fork of the LM-evaluation-harness (https://github.com/OpenGPTX/lm-evaluation-harness), based on V1 of the https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard. Currently, benchmarks are available in 21 European languages (Irish, Maltese, Croatian missing). </div> <div lang="en" dir="ltr" class="mw-content-ltr"> * [https://huggingface.co/spaces/Eurolingua/european-llm-leaderboard European LLM Leaderboard] * Open Dutch LLM Leaderboard: This is discontinued. </div> <div lang="en" dir="ltr" class="mw-content-ltr"> === Euroeval monolingual Dutch and multilingual Germanic=== </div> <div lang="en" dir="ltr" class="mw-content-ltr"> *[https://euroeval.com/leaderboards/Monolingual/dutch/ Euroeval Monolingual Dutch]: This leaderboard does not only include monolingual Dutch models, but also monolingual models in other languages such as English and German, Scandinavian models, multilingual models where Dutch is not necessarily included etc. Note: this leaderboard "replaces" scandeval.com/dutch-nlg/. *[https://euroeval.com/leaderboards/Multilingual/germanic/ Euroeval multilingual Germanic]: This leaderboard includes Danish, Dutch, English, Faroese, German, Icelandic, Norwegian and Swedish. </div> <div lang="en" dir="ltr" class="mw-content-ltr"> ==n-gram modeling== </div> <div lang="en" dir="ltr" class="mw-content-ltr"> Colibri core is an NLP tool as well as a C++ and Python library for working with basic linguistic constructions such as n-grams and skipgrams (i.e. patterns with one or more gaps, either of fixed or dynamic size) in a quick and memory-efficient way. At the core is the tool colibri-patternmodeller which allows you to build, view, manipulate and query pattern models. </div> <div lang="en" dir="ltr" class="mw-content-ltr"> *[http://proycon.github.io/colibri-core/ Github repository] </div>