Embeddings: Difference between revisions

From Clarin K-Centre
Jump to navigation Jump to search
No edit summary
No edit summary
 
(2 intermediate revisions by one other user not shown)
Line 1: Line 1:
<languages/>
<translate>
<translate>
<!--T:1-->
For Large Language Models, we refer to [[Language_Modeling]].
For Large Language Models, we refer to [[Language_Modeling]].


== Word2Vec embeddings==
== Word2Vec embeddings== <!--T:2-->


<!--T:3-->
Repository for the word embeddings described in Evaluating Unsupervised Dutch Word Embeddings as a Linguistic Resource, presented at LREC 2016.
Repository for the word embeddings described in Evaluating Unsupervised Dutch Word Embeddings as a Linguistic Resource, presented at LREC 2016.
* [https://github.com/clips/dutchembeddings Download page]
* [https://github.com/clips/dutchembeddings Download page]


<!--T:4-->
== FastText embeddings==
== FastText embeddings==
Word vectors in 157 languages trained on CommonCrawl and Wikipedia corpora.
Word vectors in 157 languages trained on CommonCrawl and Wikipedia corpora.
* [https://fasttext.cc/docs/en/crawl-vectors.html Download page]
* [https://fasttext.cc/docs/en/crawl-vectors.html Download page]


<!--T:5-->
==Coosto embeddings==
==Coosto embeddings==
This repository contains a Word2Vec model trained on a large Dutch corpus, comprised of social media messages and posts from Dutch news, blog and fora.  
This repository contains a Word2Vec model trained on a large Dutch corpus, comprised of social media messages and posts from Dutch news, blog and fora.  


<!--T:6-->
* [https://github.com/coosto/dutch-word-embeddings Github page]
* [https://github.com/coosto/dutch-word-embeddings Github page]


<!--T:7-->
==GeenStijl.nl embeddings ==
==GeenStijl.nl embeddings ==
GeenStijl.nl embeddings contains over 8M messages from the controversial Dutch websites GeenStijl and Dumpert to train a word embedding model that captures the toxic language representations contained in the dataset. The trained word embeddings (±150MB) are released for free and may be useful for further study on toxic online discourse.
GeenStijl.nl embeddings contains over 8M messages from the controversial Dutch websites GeenStijl and Dumpert to train a word embedding model that captures the toxic language representations contained in the dataset. The trained word embeddings (±150MB) are released for free and may be useful for further study on toxic online discourse.


<!--T:8-->
*[https://www.textgain.com/portfolio/geenstijl-embeddings/ Project page]
*[https://www.textgain.com/portfolio/geenstijl-embeddings/ Project page]
*[https://www.textgain.com/wp-content/uploads/2021/06/TGTR4-geenstijl.pdf Report]
*[https://www.textgain.com/wp-content/uploads/2021/06/TGTR4-geenstijl.pdf Report]
*[https://www.textgain.com/projects/geenstijl/geenstijl_embeddings.zip Download page]
*[https://www.textgain.com/projects/geenstijl/geenstijl_embeddings.zip Download page]


<!--T:9-->
==NLPL Word Embeddings Repository==
==NLPL Word Embeddings Repository==
Made by the University of Oslo. Models trained with clearly stated hyperparametes, on clearly described and linguistically pre-processed corpora.
Made by the University of Oslo. Models trained with clearly stated hyperparameters, on clearly described and linguistically pre-processed corpora.


<!--T:10-->
For Dutch, Word2Vec and ELMO embeddings are available.
For Dutch, Word2Vec and ELMO embeddings are available.


<!--T:11-->
*[http://vectors.nlpl.eu/repository/ Repository page]
*[http://vectors.nlpl.eu/repository/ Repository page]
</translate>
</translate>

Latest revision as of 08:18, 18 April 2024

Other languages:

For Large Language Models, we refer to Language_Modeling.

Word2Vec embeddings

Repository for the word embeddings described in Evaluating Unsupervised Dutch Word Embeddings as a Linguistic Resource, presented at LREC 2016.

FastText embeddings

Word vectors in 157 languages trained on CommonCrawl and Wikipedia corpora.

Coosto embeddings

This repository contains a Word2Vec model trained on a large Dutch corpus, comprised of social media messages and posts from Dutch news, blog and fora.

GeenStijl.nl embeddings

GeenStijl.nl embeddings contains over 8M messages from the controversial Dutch websites GeenStijl and Dumpert to train a word embedding model that captures the toxic language representations contained in the dataset. The trained word embeddings (±150MB) are released for free and may be useful for further study on toxic online discourse.

NLPL Word Embeddings Repository

Made by the University of Oslo. Models trained with clearly stated hyperparameters, on clearly described and linguistically pre-processed corpora.

For Dutch, Word2Vec and ELMO embeddings are available.