Embeddings: Difference between revisions
No edit summary |
No edit summary |
||
Line 7: | Line 7: | ||
Word vectors in 157 languages trained on CommonCrawl and Wikipedia corpora. | Word vectors in 157 languages trained on CommonCrawl and Wikipedia corpora. | ||
* [https://fasttext.cc/docs/en/crawl-vectors.html Download page] | * [https://fasttext.cc/docs/en/crawl-vectors.html Download page] | ||
==Coosto embeddings= | |||
* [https://github.com/coosto/dutch-word-embeddings Github page] | |||
==BERT embeddings== | ==BERT embeddings== |
Revision as of 09:05, 31 October 2023
Word2Vec embeddings
Repository for the word embeddings described in Evaluating Unsupervised Dutch Word Embeddings as a Linguistic Resource, presented at LREC 2016.
FastText embeddings
Word vectors in 157 languages trained on CommonCrawl and Wikipedia corpora.
=Coosto embeddings
BERT embeddings
GeenStijl.nl embeddings
GeenStijl.nl embeddings contains over 8M messages from the controversial Dutch websites GeenStijl and Dumpert to train a word embedding model that captures the toxic language representations contained in the dataset. The trained word embeddings (±150MB) are released for free and may be useful for further study on toxic online discourse.
NLPL Word Embeddings Repository
Made by the University of Oslo. Models trained with clearly stated hyperparametes, on clearly described and linguistically pre-processed corpora.
For Dutch, Word2Vec and ELMO embeddings are available.