Jump to navigation Jump to search
Repository for the word embeddings described in Evaluating Unsupervised Dutch Word Embeddings as a Linguistic Resource, presented at LREC 2016.
Word vectors in 157 languages trained on CommonCrawl and Wikipedia corpora.
GeenStijl.nl embeddings contains over 8M messages from the controversial Dutch websites GeenStijl and Dumpert to train a word embedding model that captures the toxic language representations contained in the dataset. The trained word embeddings (±150MB) are released for free and may be useful for further study on toxic online discourse.