Social media corpora: Difference between revisions

From Clarin K-Centre
Jump to navigation Jump to search
No edit summary
Line 3: Line 3:


* [https://github.com/tommasoc80/DALC Github]
* [https://github.com/tommasoc80/DALC Github]
* Publication: Caselli, Tommaso, Schelhaas, Arjan, Weultjes, Marieke, Leistra, Folkert, van der Veen, Hylke, Timmerman, Gerben and Nissim, Malvina (2021). DALC: the Dutch Abusive Language Corpus. Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH). Association for Computational Linguistics
* Publication: '''Caselli, Tommaso, Schelhaas, Arjan, Weultjes, Marieke, Leistra, Folkert, van der Veen, Hylke, Timmerman, Gerben and Nissim, Malvina''' (2021). [https://aclanthology.org/2021.woah-1.6/ DALC: the Dutch Abusive Language Corpus.] Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH). Association for Computational Linguistics


==SoNaR New Media Corpus==
==SoNaR New Media Corpus==

Revision as of 11:25, 20 January 2022

DALC Dutch Abusive Language Corpus

The Dutch Abusive Language Corpus v1.0 (DALC v1.0)

  • Github
  • Publication: Caselli, Tommaso, Schelhaas, Arjan, Weultjes, Marieke, Leistra, Folkert, van der Veen, Hylke, Timmerman, Gerben and Nissim, Malvina (2021). DALC: the Dutch Abusive Language Corpus. Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH). Association for Computational Linguistics

SoNaR New Media Corpus

The SoNaR New Media Corpus 1.0 contains new media texts collected within the STEVIN project SoNaR. The corpus contains text messages, tweets and chat messages. The texts were tokenized, POS-tagged and lemmatized.

Whatsapp corpus Verheijen

Whatsappdata collected for the PhD research of Lieke Verheijen (Radboud University). Informed consent was only obtained from the contributor and not from the conversational partner. Consequently, the subcorpus only contains contributions from the submitter.