Export translations

Settings

Group

Language

Format

Export for off-line translation

Export in native format

Export in CSV format

<div lang="en" dir="ltr" class="mw-content-ltr">
Spoken corpora are corpora that consist of spoken data or material based on spoken data.
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
==Boarnsterhim Corpus (BHC) (Currently unavailable)==
The Boarnsterhim Corpus consists of 250 hours of speech in both West Frisian and Dutch by the same sample of bilingual speakers. The corpus contains original recordings from 1982-1984 and a replication study recorded 35 years later. The data collection spans speech of four generations, and combines panel and trend data.
''##This corpus is temporarily unavailable because because it is under revision. For more information, please contact Hans van de Velde (HvandeVelde@fryske-akademy.nl) or Wilbert Heeringa, datamanager of the Fryske Akademy (wheeringa@fryske-akademy.nl).##
''
*42.6 MB
*version 1.0 (2020)
*data set from 1982-1984 + replication 35 years later
*[http://hdl.handle.net/10032/tm-a2-r4 Download page]
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
== COPAS: Corpus Pathologische en Normale Spraak ==
A collection recordings of almost 200 speakers with an audible speech impediment and a control group of 122 speakers.
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
* Belgian Dutch
* [http://hdl.handle.net/10032/tm-a2-n3 Download page]
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
==Corpus Gesproken Nederlands== 
The Spoken Dutch Corpus contains almost 9 million words of contemporary spoken Dutch from native speakers in Flanders and the Netherlands.
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
The speech recordings are aligned with several transcriptions (e.g. orthographic, phonetic) and annotations (syntax, POS-tags). Metadata, lexica, frequency lists and the tool Corex which can be used to explore the data are included.
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
* 900 hours of spoken Dutch
* 1998 - 2004
* tagged, lemmatized, annotated (orthographic/phonetic)
* corpus exploration software (Corex)
* version 2.0.3.
* [https://taalmaterialen.ivdnt.org/wp-content/uploads/documentatie/cgn_website/doc_English/start.htm Project website]
* [http://hdl.handle.net/10032/tm-a2-k6 Download page]
* [https://portal.clarin.inl.nl/opensonar_frontend/opensonar/search Online search with OpenSonar].  If you go to ''Extended Mode'' you can select to exclusively search in the Corpus Spoken Dutch. (See [[Corpus querying]] for more information on OpenSonar.)
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
==IFA Spoken Language Corpus==
The IFA Spoken Language corpus is a free (GPL) database of hand-segmented Dutch speech at the phoneme level. It was constructed with off-the-shelf software using speech from 8 speakers in a variety of speaking styles. For a total of 50,000 words (41 minutes/speaker), speech acquisition and preparation took around 3 person-weeks per speaker.
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
*version 1.0 (2001)
*4.6 MB
*[http://hdl.handle.net/10032/tm-a2-n8 Download page]
*[https://www.fon.hum.uva.nl/IFA-SpokenLanguageCorpora/IFAcorpus/ Project website]
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
==JASMIN-spraakcorpus==
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
A corpus of contemporary Dutch (Dutch/Flemish) as spoken by children of different age groups, elderly people and non-natives with different mother tongues, and human-machine interaction
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
* 115 hours of spoken Dutch
* speech of children, elderly people and non-natives, and human-machine interaction
* verbatim transcription, a transcription of the human-machine interaction (HMI) phenomena, POS tagging of the words, and an automatic phonetic transcription 
* version 1.0 (2008)
* [https://taalmaterialen.ivdnt.org/wp-content/uploads/documentatie/jasmin_lrec2008_en.pdf Recording Speech of Children, Non-Natives and Elderly People for HLT Applications: the JASMIN-CGN Corpus (LREC Proceedings 2008)]
* [http://hdl.handle.net/10032/tm-a2-j7 Download page]
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
==SABeD -- Spoken Academic Belgian Dutch==
The Spoken Academic Belgian Dutch Corpus consists of 200 lectures given in higher education institutions in Flanders. The first 25 and the last 5 minutes of each lecture were transcribed using an ASR system tuned to Belgian Dutch and then manual utterance segmentation was applied, followed by manual correction of the automated transcription. The resulting text is processed with the FROG language analyser.
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
* Version 1.1 (2025)
* [https://www.arts.kuleuven.be/ling/language-education-society/projects/sabed Project website]
* [https://hdl.handle.net/10032/tm-a3-a9 Download page]
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
==AUTONOMATA-namencorpus==
The AUTONOMATA Spoken Names Corpus is a database with in total about 5000 read first names, surnames, straat names, city names and check words.
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
* version 1.0 (2008)
* [https://taalmaterialen.ivdnt.org/wp-content/uploads/documentatie/auto-nc_lrec2006_en.pdf Paper]
* [http://hdl.handle.net/10032/tm-a2-m2 Download page]
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
==AUTONOMATA-POI-corpus==
The AUTONOMATA POI Corpus is a corpus of 800 pronounced points of interest from the Netherlands and Belgium containing names of restaurants, camping sites, cafés, etc.
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
* [https://taalmaterialen.ivdnt.org/wp-content/uploads/documentatie/auto-poi_documentatie_nl.pdf Documentation] 
* [http://lands.let.ru.nl/projects/AutonomataToo/index.php Project website]
* [http://hdl.handle.net/10032/tm-a2-n7 Download page]
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
==Children's Oral Reading Corpus (CHOREC)==
The CHOREC Corpus contains recorded, transcribed and annotated read speech (42 GB or 130 hours) of 400 Dutch speaking elementary school children with or without reading difficulties. Analyses of inter- and intra-annotator agreement are carried out in order to investigate the consistency with which reading errors are detected, orthographic and phonetic transcriptions are made, and reading errors and reading strategies are labeled.
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
* [https://taalmaterialen.ivdnt.org/wp-content/uploads/documentatie/chorec_documentatie_en.pdf Paper]
* [http://hdl.handle.net/10032/tm-a2-j5 Download page]
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
==BLISS Dialogue Summaries==
This dataset consists of Dutch recordings of participants talking with the BLISS dialogue system about their everyday occupations and their favorite activities. The corpus contains 55 recordings with an average duration of 2 minutes and 34 seconds.
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
*[http://hdl.handle.net/10032/tm-a2-v3 Download page]
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
==The Ernestus Corpus of Spontaneous Dutch==
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
The Ernestus Corpus of Spontaneous Dutch contains high quality recordings of 10 conversations, each 90 minutes long, between friends or direct colleagues. The corpus was recorded between autumn 1995 and spring 1996 at the Institute of Phonetics of the University of Amsterdam.
Professional transcribers have created an orthographic transcription of the corpus by hand, while a phonemic transcription has been created automatically.
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
*Publication: M. Ernestus (2000). Voice assimilation and segment reduction in casual Dutch: A corpus-based study of the phonology-phonetic interface. Holland Institute of Generative Linguistics, Utrecht.
*Website: [https://mirjamernestus.nl/Ernestus/ECSD/index.php]
*[https://hdl.handle.net/1839/a8025f06-cf20-4183-aae5-7c3309bc8c9d Mirjam Ernestus (2000). Item "The Ernestus Corpus of Spontaneous Dutch" in collection "Nijmegen corpora of casual speech". The Language Archive. https://hdl.handle.net/1839/a8025f06-cf20-4183-aae5-7c3309bc8c9d]
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
== The Parsed corpus of Southern Dutch Dialects (GCND) ==
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
The GCND, a parsed, spoken corpus of Southern Dutch Dialects (GCND) is a linguistically annotated corpus based on existing dialect recordings from the 1960s and 1970s: Voices from the past,  supplemented with existing recordings form the Meertens Institute and a number of new recordingshe. The corpus provides audio aligned transcriptions in two layers, one closer to the dialect and one closer to Standard Dutch, both are part-of-speech tagged and syntactically tagged. The corpus is meant to facilitate large-scale research into syntactical particularities of the southern Dutch dialects.
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
* [https://www.gcnd.ugent.be/en/home/ Website]
* [https://hdl.handle.net/10032/tm-a2-z8 Search online]
* [https://gcnd-gretel.ivdnt.org Treebank query]
</div>