Embeddings: Difference between revisions

From Clarin K-Centre
Jump to navigation Jump to search
No edit summary
(2 intermediate revisions by one other user not shown)
Line 1: Line 1:
* Word2Vec embeddings: https://github.com/clips/dutchembeddings
== Word2Vec embeddings==
* BERT embeddings
 
** [https://arxiv.org/abs/1912.09582 BERTje]
Repository for the word embeddings described in Evaluating Unsupervised Dutch Word Embeddings as a Linguistic Resource, presented at LREC 2016.
** [https://people.cs.kuleuven.be/~pieter.delobelle/robbert/ RobBERT]
* [https://github.com/clips/dutchembeddings Download page]
 
== FastText embeddings==
Word vectors in 157 languages trained on CommonCrawl and Wikipedia corpora.
* [https://fasttext.cc/docs/en/crawl-vectors.html Download page]
 
==BERT embeddings==
*[https://arxiv.org/abs/1912.09582 BERTje]
*[https://people.cs.kuleuven.be/~pieter.delobelle/robbert/ RobBERT]
 
==GeenStijl.nl embeddings ==
GeenStijl.nl embeddings contains over 8M messages from the controversial Dutch websites GeenStijl and Dumpert to train a word embedding model that captures the toxic language representations contained in the dataset. The trained word embeddings (±150MB) are released for free and may be useful for further study on toxic online discourse.
 
*[https://www.textgain.com/portfolio/geenstijl-embeddings/ Project page]
*[https://www.textgain.com/wp-content/uploads/2021/06/TGTR4-geenstijl.pdf Report]
*[https://www.textgain.com/projects/geenstijl/geenstijl_embeddings.zip Download page]

Revision as of 07:08, 30 June 2022

Word2Vec embeddings

Repository for the word embeddings described in Evaluating Unsupervised Dutch Word Embeddings as a Linguistic Resource, presented at LREC 2016.

FastText embeddings

Word vectors in 157 languages trained on CommonCrawl and Wikipedia corpora.

BERT embeddings

GeenStijl.nl embeddings

GeenStijl.nl embeddings contains over 8M messages from the controversial Dutch websites GeenStijl and Dumpert to train a word embedding model that captures the toxic language representations contained in the dataset. The trained word embeddings (±150MB) are released for free and may be useful for further study on toxic online discourse.