Embeddings/nl: Difference between revisions

Revision as of 12:47, 26 March 2024

Voor Large Language Models (LLM), verwijzen wij naar Taalmodellering.

Word2Vec embeddings

Database van de word embeddings die zijn beschreven in het paper 'Evaluating Unsupervised Dutch Word Embeddings as a Linguistic Resource', dat werd gepresenteerd bij LREC in 2016.

Download pagina

FastText embeddings

Woord-vectors in 157 talen, getraind op CommonCrawl en Wikipedia-corpora.

Download page

Coosto embeddings

Deze database bevat een Word2Vec-model dat is getraind op een groot Nederlands corpus, bestaande uit social-media berichten en posts van Nederlands nieuws, blogs en fora.

Github pagina

GeenStijl.nl embeddings

GeenStijl.nl embeddings contains over 8M messages from the controversial Dutch websites GeenStijl and Dumpert to train a word embedding model that captures the toxic language representations contained in the dataset. The trained word embeddings (±150MB) are released for free and may be useful for further study on toxic online discourse.

NLPL Word Embeddings Repository

Made by the University of Oslo. Models trained with clearly stated hyperparametes, on clearly described and linguistically pre-processed corpora.

For Dutch, Word2Vec and ELMO embeddings are available.

Repository page

@@ Line 3: / Line 3: @@
 == Word2Vec embeddings==
-Opslagplaats voor de word embeddings die zijn beschreven in het paper 'Evaluating Unsupervised Dutch Word Embeddings as a Linguistic Resource', dat werd gepresenteerd bij LREC in 2016.
+Database van de word embeddings die zijn beschreven in het paper 'Evaluating Unsupervised Dutch Word Embeddings as a Linguistic Resource', dat werd gepresenteerd bij LREC in 2016.
 * [https://github.com/clips/dutchembeddings Download pagina]
@@ Line 10: / Line 10: @@
 * [https://fasttext.cc/docs/en/crawl-vectors.html Download page]
-<div lang="en" dir="ltr" class="mw-content-ltr">
 ==Coosto embeddings==
-This repository contains a Word2Vec model trained on a large Dutch corpus, comprised of social media messages and posts from Dutch news, blog and fora.
+Deze database bevat een Word2Vec-model dat is getraind op een groot Nederlands corpus, bestaande uit social-media berichten en posts van Nederlands nieuws, blogs en fora.
-</div>
-<div lang="en" dir="ltr" class="mw-content-ltr">
+* [https://github.com/coosto/dutch-word-embeddings Github pagina]
-* [https://github.com/coosto/dutch-word-embeddings Github page]
-</div>
 <div lang="en" dir="ltr" class="mw-content-ltr">

Embeddings/nl: Difference between revisions

Revision as of 12:47, 26 March 2024

Contents

Word2Vec embeddings

FastText embeddings

Coosto embeddings

GeenStijl.nl embeddings

NLPL Word Embeddings Repository

Navigation menu

Embeddings/nl: Difference between revisions

Revision as of 12:47, 26 March 2024

Word2Vec embeddings

FastText embeddings

Coosto embeddings

GeenStijl.nl embeddings

NLPL Word Embeddings Repository

Navigation menu

Search