Corpora and lexicons: Difference between revisions

From Clarin K-Centre
Jump to navigation Jump to search
Line 22: Line 22:


The corpus was compiled on the basis of mainly critical, scientifically sound text editions. In time, it will be annotated with word classes and lemmas, to improve searchability.
The corpus was compiled on the basis of mainly critical, scientifically sound text editions. In time, it will be annotated with word classes and lemmas, to improve searchability.
==Corpus of Old Dutch==
The Corpus of Old Dutch is the collection of all texts in Old Dutch that served as source material for the Dictionary of Old Dutch (ONW). The texts originate from the period between 475 and 1200.
The texts in Old Dutch that Maurits Gysseling had collected and transcribed formed the basis of this collection. They have been supplemented with texts like the Mittelfränkische Reimbibel, glosses like the Malbergse glossen to the Lex Salica, and anthroponymic and toponymic material. The corpus has been annotated with word classes and lemmas. The annotation of the entire corpus has been manually verified.
What is Old Dutch
Old Dutch is the collective term for several related dialects that – just like Old English, Old Frisian, Old Saxon, and Old High German – developed out of West Germanic around the beginning of the fifth century. It was spoken in an area that does not entirely correspond with the current Dutch-speaking region.
Differentiating between Old Dutch, Old Saxon, and Old Frisian is sometimes difficult. The editors of the Dictionary of Old Dutch, who were responsible for the compilation of the corpus, applied a liberal admission policy. Nevertheless, not all texts from Gysseling’s original Old Dutch collection were incorporated into the corpus. One example is the Heliand, a poem that was left out because it was written in Old Saxon.

Revision as of 15:33, 26 November 2020

Brieven als Buit

Gysseling Corpus

The Gysseling Corpus is the collection of all 13th-century texts that have served as source material for the Dictionary of Early Middle Dutch (VMNW). The corpus consists mainly of official and literary sources of thirteenth-century texts that have been handed down in 13th-century manuscripts.

The texts are diplomatic editions, which means that the source texts have been rendered in modern script as accurately as possible. The corpus has been linguistically annotated with word classes and modern Dutch lemmas (entry words) to enhance its searchability. The annotation of the entire corpus has been manually verified.

Corpus of Contemporary Dutch

In order to monitor contemporary Dutch, the Dutch Language Institute has created the Corpus of Contemporary Dutch (CHN): an ever-growing collection of already more than 800,000 texts from newspapers, magazines, news broadcasts and legal materials.

Contents of the CHN

We try to include sources in this corpus that continually provide us with new text materials. But in principle, all text materials used in the various projects of the Dutch Language Institute end up in the CHN, such as the ANW corpus (1970 – now), compiled for our Dictionary of Contemporary Dutch.

From 1994 onwards, the Institute for Dutch Lexicology (INL), predecessor of the Dutch Language Institute, put several corpora of contemporary Dutch online: the 5, 27 and 38 million words corpora, and the Dutch Parole Internet Corpus. The materials from these older corpora have been added to the CHN.

Corpus of Middle Dutch

The Corpus of Middle Dutch is a collection of rhyming texts and prose from the period of 1300-1550. It contains classics such as Beatrijs, Van den vos Reynaerde, the abele spelen, the stories about King Arthur and about Charlemagne, all texts from the famous Gruuthuuse manuscript (including the Egidius song), but also many of the lesser known or less researched texts, such as prose adaptations of the rhyming knight’s tales (the so-called ‘folk books’), collections of songs such as the Antwerp Songbook, various Bible translations, hagiographies, books of prayer, chronicles, and all kinds of religious, didactic and scientific treatises, medical manuals and recipes.

The corpus was compiled on the basis of mainly critical, scientifically sound text editions. In time, it will be annotated with word classes and lemmas, to improve searchability.

Corpus of Old Dutch

The Corpus of Old Dutch is the collection of all texts in Old Dutch that served as source material for the Dictionary of Old Dutch (ONW). The texts originate from the period between 475 and 1200.

The texts in Old Dutch that Maurits Gysseling had collected and transcribed formed the basis of this collection. They have been supplemented with texts like the Mittelfränkische Reimbibel, glosses like the Malbergse glossen to the Lex Salica, and anthroponymic and toponymic material. The corpus has been annotated with word classes and lemmas. The annotation of the entire corpus has been manually verified. What is Old Dutch

Old Dutch is the collective term for several related dialects that – just like Old English, Old Frisian, Old Saxon, and Old High German – developed out of West Germanic around the beginning of the fifth century. It was spoken in an area that does not entirely correspond with the current Dutch-speaking region.

Differentiating between Old Dutch, Old Saxon, and Old Frisian is sometimes difficult. The editors of the Dictionary of Old Dutch, who were responsible for the compilation of the corpus, applied a liberal admission policy. Nevertheless, not all texts from Gysseling’s original Old Dutch collection were incorporated into the corpus. One example is the Heliand, a poem that was left out because it was written in Old Saxon.