Social media corpora: Difference between revisions

From Clarin K-Centre
Jump to navigation Jump to search
No edit summary
No edit summary
Line 5: Line 5:
* Publication: Caselli, Tommaso, Schelhaas, Arjan, Weultjes, Marieke, Leistra, Folkert, van der Veen, Hylke, Timmerman, Gerben and Nissim, Malvina (2021). DALC: the Dutch Abusive Language Corpus. Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH). Association for Computational Linguistics
* Publication: Caselli, Tommaso, Schelhaas, Arjan, Weultjes, Marieke, Leistra, Folkert, van der Veen, Hylke, Timmerman, Gerben and Nissim, Malvina (2021). DALC: the Dutch Abusive Language Corpus. Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH). Association for Computational Linguistics


==Sonar new media==
==SoNaR New Media Corpus==
*download website: http://hdl.handle.net/10032/tm-a2-k3
The SoNaR New Media Corpus 1.0 contains new media texts collected within the STEVIN project SoNaR. The corpus contains text messages, tweets and chat messages. The texts were tokenized, POS-tagged and lemmatized.
 
* [http://hdl.handle.net/10032/tm-a2-k3 Download page]


==Whatsapp corpus Verheijen==
==Whatsapp corpus Verheijen==
*project website: https://easy.dans.knaw.nl/ui/datasets/id/easy-dataset:112987
Whatsappdata collected for the PhD research of Lieke Verheijen (Radboud University). Informed consent was only obtained from the contributor and not from the conversational partner. Consequently, the subcorpus only contains contributions from the submitter.
* [https://easy.dans.knaw.nl/ui/datasets/id/easy-dataset:112987 Project website]

Revision as of 14:47, 15 November 2021

DALC Dutch Abusive Language Corpus

The Dutch Abusive Language Corpus v1.0 (DALC v1.0)

  • Github
  • Publication: Caselli, Tommaso, Schelhaas, Arjan, Weultjes, Marieke, Leistra, Folkert, van der Veen, Hylke, Timmerman, Gerben and Nissim, Malvina (2021). DALC: the Dutch Abusive Language Corpus. Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH). Association for Computational Linguistics

SoNaR New Media Corpus

The SoNaR New Media Corpus 1.0 contains new media texts collected within the STEVIN project SoNaR. The corpus contains text messages, tweets and chat messages. The texts were tokenized, POS-tagged and lemmatized.

Whatsapp corpus Verheijen

Whatsappdata collected for the PhD research of Lieke Verheijen (Radboud University). Informed consent was only obtained from the contributor and not from the conversational partner. Consequently, the subcorpus only contains contributions from the submitter.