Other corpora

From Clarin K-Centre
Jump to navigation Jump to search


The Basilex corpus is an annotated collection of texts written for children in the age from four to twelve years.


The BasiScript Corpus is an annotated collection of texts written by children in the age from four to twelve years.

CLiPS Stylometry Investigation (CSI) Corpus

The CSI corpus is a yearly expanded corpus of student texts in two genres: essays and reviews. The purpose of this corpus lies primarily in stylometric research, but other applications are possible. There is a vast amount of meta-data available, both on the author (gender, age, sexual orientation, region of origin, personality profile) and on the document (timestamp, genre, veracity, sentiment, grade). The current version of the corpus was assembled in February 2016. Previous versions of the corpus are available from the authors via e-mail request.


The CONDIV-corpus was specifically designed to study the convergence or divergence (hence the name) between Netherlandic Dutch and Belgian Dutch. It contains a synchronous and a diachronous part. To get access to the data, you need to contact Dirk Speelman at KU Leuven


The COREA Coreference Corpus is a corpus of Dutch texts annotated with corerefence relations.


The D-TUNA Corpus consists of 2400 written and (transcribed) spoken referential expressions.


The DBRD (pronounced dee-bird) dataset contains over 110k book reviews of which 22k have associated binary sentiment polarity labels. It is intended as a benchmark for sentiment classification in Dutch. The dataset can be used to train a model for sequence modeling, more specifically language modeling and it can be used to train a model for text classification, more specifically sentiment classification, using the provided positive/negative sentiment polarity labels.

Dutch Audio Description Corpus

The Dutch Audio Description corpus includes the transcribed texts of 39 audio described Dutch films and TV series, in total 154,570 words and 3,074 minutes of video. This Dutch AD corpus was used to extract a series of quantitative data regarding the language of AD, namely frequency counts of parts of speech, words, lemmas, collocations and the calculation of other relevant text statistics such as reading speed, word and sentence length, text readability and type token ratios (a statistical measure reflecting lexical variety).


The deLearyous dataset is a Dutch (Flemish) dataset for emotion classification following the framework of Leary's Rose, also known as the Interpersonal Circumplex. The dataset contains 11 conversations that were annotated on the sentence level with their position on Leary's Rose, in function of the two defining dimensions: "dominance", and "affinity".

Dutch Idiom Database: Native Speakers (DID-NS)

The DID-NS is a database with appreciations by 390 native speakers of 374 Dutch idiomatic expressions.

NAMES Corpus

The NAMES Corpus is a corpus of Dutch given names and surnames as present in 19th century certificates for birth, marriage and decease. The name variants have been assigned to a standard form.

Personae Corpus

The Personae corpus was collected for experiments in Authorship Attribution and Personality Prediction. It consists of 145 Dutch-language essays, written by 145 different students (BA in Linguistics and Literature at the University of Antwerp, Belgium). Each student also took an online MBTI personality test, allowing personality prediction experiments. The corpus was controlled for topic, register, genre, age, and education level. The original texts, a syntactically annotated version of the texts, and the metadata are available.


A corpus sample of Dutch human-computer dialogues annotated with negation cues.

Multimodal ABEA

Multimodal dataset that can be used in the context of aspect-based sentiment and emotion detection. It consists of 4,900 comments on 175 images from the Adidas Instagram page and is annotated with both aspect and emotion labels.