Embeddings: Difference between revisions

From Clarin K-Centre
Jump to navigation Jump to search
No edit summary
No edit summary
Line 1: Line 1:
* Word2Vec embeddings: https://github.com/clips/dutchembeddings
== Word2Vec embeddings==
* BERT embeddings
 
** [https://arxiv.org/abs/1912.09582 BERTje]
Repository for the word embeddings described in Evaluating Unsupervised Dutch Word Embeddings as a Linguistic Resource, presented at LREC 2016.
** [https://people.cs.kuleuven.be/~pieter.delobelle/robbert/ RobBERT]
* [https://github.com/clips/dutchembeddings Download page]
 
 
==BERT embeddings==
*[https://arxiv.org/abs/1912.09582 BERTje]
*[https://people.cs.kuleuven.be/~pieter.delobelle/robbert/ RobBERT]
 
==GeenStijl.nl embeddings ==
GeenStijl.nl embeddings contains over 8M messages from the controversial Dutch websites GeenStijl and Dumpert to train a word embedding model that captures the toxic language representations contained in the dataset. The trained word embeddings (±150MB) are released for free and may be useful for further study on toxic online discourse.
 
*[https://www.textgain.com/portfolio/geenstijl-embeddings/ Project page]
*[https://www.textgain.com/wp-content/uploads/2021/06/TGTR4-geenstijl.pdf Report]

Revision as of 09:23, 4 March 2022

Word2Vec embeddings

Repository for the word embeddings described in Evaluating Unsupervised Dutch Word Embeddings as a Linguistic Resource, presented at LREC 2016.


BERT embeddings

GeenStijl.nl embeddings

GeenStijl.nl embeddings contains over 8M messages from the controversial Dutch websites GeenStijl and Dumpert to train a word embedding model that captures the toxic language representations contained in the dataset. The trained word embeddings (±150MB) are released for free and may be useful for further study on toxic online discourse.