Embeddings

From Clarin K-Centre
Jump to navigation Jump to search

Word2Vec embeddings[edit]

Repository for the word embeddings described in Evaluating Unsupervised Dutch Word Embeddings as a Linguistic Resource, presented at LREC 2016.

FastText embeddings[edit]

Word vectors in 157 languages trained on CommonCrawl and Wikipedia corpora.

BERT embeddings[edit]

GeenStijl.nl embeddings[edit]

GeenStijl.nl embeddings contains over 8M messages from the controversial Dutch websites GeenStijl and Dumpert to train a word embedding model that captures the toxic language representations contained in the dataset. The trained word embeddings (±150MB) are released for free and may be useful for further study on toxic online discourse.