Jump to content

Translations:Language modeling/12/en: Difference between revisions

From Clarin K-Centre
FuzzyBot (talk | contribs)
Importing a new version from external source
 
FuzzyBot (talk | contribs)
m FuzzyBot moved page Translations:Language Modeling/12/en to Translations:Language modeling/12/en without leaving a redirect: Part of translatable page "Language Modeling"
(No difference)

Revision as of 16:57, 11 June 2024

Information about message (contribute)
This message has no documentation. If you know where or how this message is used, you can help other translators by adding documentation to this message.
Message definition (Language modeling)
* [https://github.com/wietsedv/bertje BERTje]: A Dutch pre-trained BERT model developed at the University of Groningen. Compared to the multilingual BERT model, which includes Dutch but is only based on Wikipedia text, BERTje is based on a large and diverse dataset of 2.4 billion tokens. The vocabulary size of BERTje has changed in 2021. Encoder only. Original paper: http://arxiv.org/abs/1912.09582
* [https://pieter.ai/robbert/ RobBERT]: A RoBERTa-based state-of-the-art Dutch language model, which was released in 2020. Encoder only. Original paper: https://arxiv.org/abs/2001.06286
* [https://pieter.ai/robbert-2022/ RobBERT-2022]: An update of the RobBERT Dutch language model to include new high-frequent tokens present in the latest Dutch OSCAR corpus from 2022. The RobBERT model was then pre-trained using this dataset. This model is a plug-in replacement for RobBERT and results in a significant performance increase for certain language tasks. Encoder only. Original paper: https://arxiv.org/abs/2211.08192v1
* [https://pieter.ai/robbert-2023/ RobBERT-2023]: A freshly pre-trained Dutch tokenizer using the latest version of the Dutch OSCAR corpus. This corpus incorporates new high-frequency terms, such as those related to the COVID-19 pandemic, cryptocurrencies, and the ongoing energy crisis. Unlike the prior versions of RobBERT, which relied on the training methodology of RoBERTa but required a fresh weight initialization, RobBERT-2023 is entirely initialized using the RoBERTa-large model. Encoder only. Original paper: https://www.clinjournal.org/clinj/article/view/180 Related paper: https://arxiv.org/pdf/2310.03477
* [https://github.com/iPieter/robbertje RobBERTje]: A collection of distilled versions of the state-of-the-art Dutch RobBERT model. There are multiple models with different sizes and different training settings. Encoder only. Original paper: https://arxiv.org/abs/2204.13511v1
* [https://huggingface.co/CLTL/MedRoBERTa.nl MedRoBERTa]: MedRoBERTa.nl is one of the two encoder models worldwide that has been pre-trained on free text from real-world hospital data and is publicly available. The anonymized model allows NLP researchers and medical professionals to build medical text mining technology with a solid base model: the model can be fine-tuned for any task. Decoder only. Original paper: https://clinjournal.org/clinj/article/view/132
* [https://github.com/Joppewouts/belabBERT belabBERT:] A Dutch RoBERTa-based language model applied to psychiatric classification, pretrained on the Dutch unshuffled OSCAR corpus using the masked language modeling (MLM) objective. The model is case sensitive and includes punctuation. Decoder only. Original paper: https://arxiv.org/abs/2106.01091.
* [https://huggingface.co/clips/contact CoNTACT]: CoNTACT (Contextual Neural Transformer Adapted to COVID-19 Tweets) is a Dutch RobBERT model (pdelobelle/robbert-v2-dutch-base) adapted to the domain of COVID-19 tweets. Encoder only. Original paper: https://arxiv.org/abs/2203.07362v1
* [https://github.com/ChocoLlamaModel/ChocoLlama ChocoLlama]: A set of six Llama-2/3 based open models adapted to Dutch. Trained on Dutch language materials from both Belgium and the Nehtherlands. Decoder only. Original paper: https://arxiv.org/html/2412.07633v1
*[https://github.com/BramVanroy/fietje-2 Fietje 2]: Fietje is a family of small open language models (SLMs) specifically designed for the Dutch language. The model is based on Phi 2, an English-centric model of 2.7 billion parameters. The fietje-2b-chat model is the one that is best suited as an assistant. Decoder only. Original paper: https://arxiv.org/abs/2412.15450
*[https://huggingface.co/Tweeties/tweety-7b-dutch-v24a Tweety-7b-dutch]: This is a foundation model with a focus on the Dutch language, incorporating a Dutch tokenizer for better understanding and generation of Dutch text. It is built on the mistral architecture. Decoder only. Original paper: https://arxiv.org/abs/2408.04303
*[https://huggingface.co/ReBatch/Reynaerde-7B-Chat Reynaerde 7B Chat]: An open conversational model for Dutch, based on Mistral v0.3 Instruct. This model is a fine-tuned version of https://huggingface.co/ReBatch/Reynaerde-7B-Instruct on https://huggingface.co/datasets/ReBatch/ultrafeedback_nl . Decoder only.
* [https://huggingface.co/robinsmits/Schaapje-2B-Chat-V1.0 Schaapje-2B-chat-V1.0]:Schaapje is a small, powerfull Dutch Small Language Model. It has a good performance in Dutch conversations, Instruction following and RAG applications. Is is based on the IBM Granite 3.0 2B Instruct model, decoder only.
* [https://github.com/Rijgersberg/GEITje GEITje 7B]: A Large Open Dutch language model with 7 billion parameters, based on Mistral 7B. This model is no longer available: https://goingdutch.ai/en/posts/geitje-takedown 
* [https://huggingface.co/papers/2412.04092 Geitje 7B Ultra] A conversational model for Dutch. The GEITje model, derived from Mistral 7B, was enhanced through supervised finetuning with synthetic conversational datasets and preference alignment to improve its capabilities in Dutch. Decoder only. Original paper: https://arxiv.org/abs/2412.04092
* [https://huggingface.co/snoels/FinGEITje-7B-sft FinGEITJE]: FinGEITje 7B is a large open Dutch financial language model with 7 billion parameters, based on Mistral 7B. It has been further trained on Dutch financial texts, enhancing its proficiency in the Dutch language and its knowledge of financial topics. As a result, FinGEITje provides more accurate and relevant responses in the domain of finance. Decorder only. Original paper: https://arxiv.org/abs/2410.12835
*[https://GPT-NL.nl GPT-NL]: A Dutch language model currently being developed by non-profit parties TNO, NFI and SURF, funded by the Dutch Ministry of Economic Affairs and Climate Policy. It is currently being trained and the first version is expected to be available in Q1 26.
* [https://huggingface.co/models?search=dutch Hugging Face Dutch Models]