Social media corpora: Difference between revisions
No edit summary |
|||
(5 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
<languages/> | |||
<translate> | |||
<!--T:1--> | |||
==DALC Dutch Abusive Language Corpus== | ==DALC Dutch Abusive Language Corpus== | ||
The Dutch Abusive Language Corpus v1.0 (DALC v1.0) | The Dutch Abusive Language Corpus v1.0 (DALC v1.0) | ||
<!--T:2--> | |||
* [https://github.com/tommasoc80/DALC Github] | * [https://github.com/tommasoc80/DALC Github] | ||
* [https://dataverse.nl/dataset.xhtml?persistentId=doi%3A10.34894%2FHOINL3 Website] | |||
* Publication: '''Caselli, Tommaso, Schelhaas, Arjan, Weultjes, Marieke, Leistra, Folkert, van der Veen, Hylke, Timmerman, Gerben and Nissim, Malvina''' (2021). [https://aclanthology.org/2021.woah-1.6/ DALC: the Dutch Abusive Language Corpus.] Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH). Association for Computational Linguistics | * Publication: '''Caselli, Tommaso, Schelhaas, Arjan, Weultjes, Marieke, Leistra, Folkert, van der Veen, Hylke, Timmerman, Gerben and Nissim, Malvina''' (2021). [https://aclanthology.org/2021.woah-1.6/ DALC: the Dutch Abusive Language Corpus.] Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH). Association for Computational Linguistics | ||
<!--T:3--> | |||
==SoNaR New Media Corpus== | ==SoNaR New Media Corpus== | ||
The SoNaR New Media Corpus 1.0 contains new media texts collected within the STEVIN project SoNaR. The corpus contains text messages, tweets and chat messages. The texts were tokenized, POS-tagged and lemmatized. | The SoNaR New Media Corpus 1.0 contains new media texts collected within the STEVIN project SoNaR. The corpus contains text messages, tweets and chat messages. The texts were tokenized, POS-tagged and lemmatized. | ||
<!--T:4--> | |||
* [http://hdl.handle.net/10032/tm-a2-k3 Download page] | * [http://hdl.handle.net/10032/tm-a2-k3 Download page] | ||
<!--T:5--> | |||
==Whatsapp corpus Verheijen== | ==Whatsapp corpus Verheijen== | ||
Whatsappdata collected for the PhD research of Lieke Verheijen (Radboud University). Informed consent was only obtained from the contributor and not from the conversational partner. Consequently, the subcorpus only contains contributions from the submitter. | Whatsappdata collected for the PhD research of Lieke Verheijen (Radboud University). Informed consent was only obtained from the contributor and not from the conversational partner. Consequently, the subcorpus only contains contributions from the submitter. | ||
* [https://easy.dans.knaw.nl/ui/datasets/id/easy-dataset:112987 Project website] | * [https://easy.dans.knaw.nl/ui/datasets/id/easy-dataset:112987 Project website] | ||
<!--T:6--> | |||
==TwiSty Author Profiling Corpus== | |||
TwiSty is a corpus developed for research in author profiling. It contains personality (MBTI) and gender annotations for a total of 18,168 authors spanning six languages. We distribute the Twitter ids of these authors as well as the ids of their available tweets at the time of corpus development. The tweets have undergone language identification and can be found in a Confirmed (as belonging to the language in which the author is situated) and Other category. | |||
</translate> | |||
*[https://zenodo.org/records/4638948 Webpage] | |||
*[https://aclanthology.org/L16-1258/ Paper] |
Latest revision as of 09:26, 17 June 2024
DALC Dutch Abusive Language Corpus
The Dutch Abusive Language Corpus v1.0 (DALC v1.0)
- Github
- Website
- Publication: Caselli, Tommaso, Schelhaas, Arjan, Weultjes, Marieke, Leistra, Folkert, van der Veen, Hylke, Timmerman, Gerben and Nissim, Malvina (2021). DALC: the Dutch Abusive Language Corpus. Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH). Association for Computational Linguistics
SoNaR New Media Corpus
The SoNaR New Media Corpus 1.0 contains new media texts collected within the STEVIN project SoNaR. The corpus contains text messages, tweets and chat messages. The texts were tokenized, POS-tagged and lemmatized.
Whatsapp corpus Verheijen
Whatsappdata collected for the PhD research of Lieke Verheijen (Radboud University). Informed consent was only obtained from the contributor and not from the conversational partner. Consequently, the subcorpus only contains contributions from the submitter.
TwiSty Author Profiling Corpus
TwiSty is a corpus developed for research in author profiling. It contains personality (MBTI) and gender annotations for a total of 18,168 authors spanning six languages. We distribute the Twitter ids of these authors as well as the ids of their available tweets at the time of corpus development. The tweets have undergone language identification and can be found in a Confirmed (as belonging to the language in which the author is situated) and Other category.