Embeddings/nl: Difference between revisions

From Clarin K-Centre
Jump to navigation Jump to search
(Created page with "Embeddings")
 
(Created page with "== FastText embeddings== Woord-vectors in 157 talen, getraind op CommonCrawl en Wikipedia-corpora. * [https://fasttext.cc/docs/en/crawl-vectors.html Download page]")
Line 1: Line 1:
<div lang="en" dir="ltr" class="mw-content-ltr">
Voor Large Language Models (LLM), verwijzen wij naar [[Taalmodellering]].
For Large Language Models, we refer to [[Language_Modeling]].
</div>


<div lang="en" dir="ltr" class="mw-content-ltr">
== Word2Vec embeddings==
== Word2Vec embeddings==
</div>


<div lang="en" dir="ltr" class="mw-content-ltr">
Opslagplaats voor de word embeddings die zijn beschreven in het paper 'Evaluating Unsupervised Dutch Word Embeddings as a Linguistic Resource', dat werd gepresenteerd bij LREC in 2016.
Repository for the word embeddings described in Evaluating Unsupervised Dutch Word Embeddings as a Linguistic Resource, presented at LREC 2016.
* [https://github.com/clips/dutchembeddings Download pagina]
* [https://github.com/clips/dutchembeddings Download page]
</div>


<div lang="en" dir="ltr" class="mw-content-ltr">
== FastText embeddings==
== FastText embeddings==
Word vectors in 157 languages trained on CommonCrawl and Wikipedia corpora.
Woord-vectors in 157 talen, getraind op CommonCrawl en Wikipedia-corpora.
* [https://fasttext.cc/docs/en/crawl-vectors.html Download page]
* [https://fasttext.cc/docs/en/crawl-vectors.html Download page]
</div>


<div lang="en" dir="ltr" class="mw-content-ltr">
<div lang="en" dir="ltr" class="mw-content-ltr">

Revision as of 12:41, 26 March 2024

Voor Large Language Models (LLM), verwijzen wij naar Taalmodellering.

Word2Vec embeddings

Opslagplaats voor de word embeddings die zijn beschreven in het paper 'Evaluating Unsupervised Dutch Word Embeddings as a Linguistic Resource', dat werd gepresenteerd bij LREC in 2016.

FastText embeddings

Woord-vectors in 157 talen, getraind op CommonCrawl en Wikipedia-corpora.

Coosto embeddings

This repository contains a Word2Vec model trained on a large Dutch corpus, comprised of social media messages and posts from Dutch news, blog and fora.

GeenStijl.nl embeddings

GeenStijl.nl embeddings contains over 8M messages from the controversial Dutch websites GeenStijl and Dumpert to train a word embedding model that captures the toxic language representations contained in the dataset. The trained word embeddings (±150MB) are released for free and may be useful for further study on toxic online discourse.

NLPL Word Embeddings Repository

Made by the University of Oslo. Models trained with clearly stated hyperparametes, on clearly described and linguistically pre-processed corpora.

For Dutch, Word2Vec and ELMO embeddings are available.