Spoken corpora: Difference between revisions
No edit summary |
No edit summary |
||
Line 1: | Line 1: | ||
Spoken corpora are corpora that consist of spoken data or material based on spoken data. | Spoken corpora are corpora that consist of spoken data or material based on spoken data. | ||
==Boarnsterhim Corpus (BHC)== | |||
The Boarnsterhim Corpus consists of 250 hours of speech in both West Frisian and Dutch by the same sample of bilingual speakers. The corpus contains original recordings from 1982-1984 and a replication study recorded 35 years later. The data collection spans speech of four generations, and combines panel and trend data. | |||
*42.6 MB | |||
*version 1.0 (2020) | |||
*[http://hdl.handle.net/10032/tm-a2-r4 Download page] | |||
==Corpus Gesproken Nederlands== | ==Corpus Gesproken Nederlands== |
Revision as of 15:26, 18 March 2021
Spoken corpora are corpora that consist of spoken data or material based on spoken data.
Boarnsterhim Corpus (BHC)
The Boarnsterhim Corpus consists of 250 hours of speech in both West Frisian and Dutch by the same sample of bilingual speakers. The corpus contains original recordings from 1982-1984 and a replication study recorded 35 years later. The data collection spans speech of four generations, and combines panel and trend data.
- 42.6 MB
- version 1.0 (2020)
- Download page
Corpus Gesproken Nederlands
(Spoken Dutch Corpus) Almost 9 million words of contemporary spoken Dutch from native speakers in Flanders and the Netherlands.
The speech recordings are aligned with several transcriptions (e.g. orthographic, phonetic) and annotations (syntax, POS-tags). Metadata, lexica, frequency lists and the tool Corex which can be used to explore the data are included.
- 900 hours of spoken Dutch
- 1998 - 2004
- tagged, lemmatized, annotated (orthographic/phonetic)
- corpus exploration software (Corex)
- version 2.0.3.
- Project website
- Download page
- Online search with OpenSonar. If you go to Extended Mode you can select to exclusively search in the Corpus Spoken Dutch. (See Corpus querying for more information on OpenSonar.)
IFA Spoken Language Corpus
The IFA Spoken Language corpus is a free (GPL) database of hand-segmented Dutch speech. It was constructed with off-the-shelf software using speech from 8 speakers in a variety of speaking styles. For a total of 50,000 words (41 minutes/speaker), speech acquisition and preparation took around 3 person-weeks per speaker.
- version 1.0 (2001)
- 4.6 MB
- Download page
- Project website
JASMIN-spraakcorpus
A corpus of contemporary Dutch (Dutch/Flemish) as spoken by children of different age groups, elderly people and non-natives with different mother tongues, and human-machine interaction
- 115 hours of spoken Dutch
- speech of children, elderly people and non-natives, and human-machine interaction
- verbatim transcription, a transcription of the human-machine interaction (HMI) phenomena, POS tagging of the words, and an automatic phonetic transcription
- version 1.0 (2008)
- Recording Speech of Children, Non-Natives and Elderly People for HLT Applications: the JASMIN-CGN Corpus (LREC Proceedings 2008)
- Download page