Reference corpora

From Clarin K-Centre
Jump to navigation Jump to search

Corpus Hedendaags Nederlands

A collection of more than 2,5 million texts taken from newspapers, magazines, news broadcasts,...

The CHN (‘Corpus of Contemporary Dutch’) contains all available modern Dutch text material of the INT for which the INT has acquired the right to put the data online. The corpus covers material from the Netherlands, Flanders, Surinam and the Dutch Antilles. Data of the legacy corpora of the former INL (5, 27 and 38 million words corpora and the Parole corpus) are part of this corpus.

The legal corpus with data from 1814-2000 that was originally part of the 38 million word corpus however was removed. This corpus is available as a separate corpus.

Lassy Large

The Lassy Large Corpus is a collection written texts consisting of approximately 700 million words with automatically generated annotations. The lemmas and POS-tags were generated with Tadpole (now Frog) and the syntactical dependency structures were generated with Alpino.

SoNaR corpus

The SoNaR corpus is a text corpus consisting of two parts, namely SoNaR-500 and SoNaR-1.

SoNaR-500 contains more than 500 million words of text from various domains and genres. All texts were tokenized, POS tagged and lemmatized. The named entities were also labeled. All SoNaR-500 annotations were generated automatically.

SoNaR-1 is largely a subset of SoNaR-500 and contains 1 million words. SoNaR-1 was provided with different types of semantic annotations, namely named entity labeling, co-reference annotation and the annotation of spatial and temporal relationships. All SoNaR-1 annotations were manually verified.

The new media texts (tweets, chats and text messages), which were also collected within the framework of the STEVIN project SoNaR, are not part of the SoNaR corpus 1.0. and are available separately as the SoNaR New Media Corpus.