Translations:Corpus querying/4/en: Difference between revisions
Jump to navigation
Jump to search
(Importing a new version from external source) |
(Importing a new version from external source) |
||
Line 1: | Line 1: | ||
== | == OpenSonar == | ||
The current application contains two corpora: | The current application contains two corpora: | ||
*The SoNaR corpus (See [[Reference corpora]] for more information.) | *The SoNaR corpus (See [[Reference corpora]] for more information.) | ||
*The Corpus of Spoken Dutch (Corpus Gesproken Nederlands, CGN) is a collection of 900 hours (almost 9 million words) of contemporary Dutch speech, originating from Flemish and Dutch speakers. The speech fragments (spontaneous and prepared) are aligned with various transcriptions (including orthographic, phonetic) and annotations (lemma, POS tags). All annotations have been verified manually, except for the phonetic transcription: only 11,3% was verified. The corpus data are available for researchers, see [http://hdl.handle.net/10032/tm-a2-k6 here]. | *The Corpus of Spoken Dutch (Corpus Gesproken Nederlands, CGN) is a collection of 900 hours (almost 9 million words) of contemporary Dutch speech, originating from Flemish and Dutch speakers. The speech fragments (spontaneous and prepared) are aligned with various transcriptions (including orthographic, phonetic) and annotations (lemma, POS tags). All annotations have been verified manually, except for the phonetic transcription: only 11,3% was verified. The corpus data are available for researchers, see [http://hdl.handle.net/10032/tm-a2-k6 here]. |
Latest revision as of 14:53, 6 August 2024
OpenSonar
The current application contains two corpora:
- The SoNaR corpus (See Reference corpora for more information.)
- The Corpus of Spoken Dutch (Corpus Gesproken Nederlands, CGN) is a collection of 900 hours (almost 9 million words) of contemporary Dutch speech, originating from Flemish and Dutch speakers. The speech fragments (spontaneous and prepared) are aligned with various transcriptions (including orthographic, phonetic) and annotations (lemma, POS tags). All annotations have been verified manually, except for the phonetic transcription: only 11,3% was verified. The corpus data are available for researchers, see here.