Translations:Manually annotated corpora/2/en

From Clarin K-Centre
Revision as of 14:34, 14 March 2024 by FuzzyBot (talk | contribs) (Importing a new version from external source)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Corpus Gesproken Nederlands

The Spoken Dutch Corpus comprises a large number of samples of (recorded) spoken text (appr. 9 million words). The entire corpus has been transcribed orthographically, while the transcripts have been linked to the speech files. The orthographic transcription was used as the starting point for the lemmatization and part-of-speech tagging of the corpus. For a selection of one million words, a (verified) broad phonetic transcription has been produced, while for this part of the corpus also the alignment of the transcripts and the speech files has been verified at the word level. In addition, a selection of one million words has been annotated syntactically. Finally, for a more modest part of the corpus, approximately 250,000 words, a prosodic annotation is available.