Spoken corpora: Difference between revisions

From Clarin K-Centre
Jump to navigation Jump to search
Line 17: Line 17:


==Corpus Gesproken Nederlands==  
==Corpus Gesproken Nederlands==  
(Spoken Dutch Corpus)
The Spoken Dutch Corpus contains almost 9 million words of contemporary spoken Dutch from native speakers in Flanders and the Netherlands.
Almost 9 million words of contemporary spoken Dutch from native speakers in Flanders and the Netherlands.


The speech recordings are aligned with several transcriptions (e.g. orthographic, phonetic) and annotations (syntax, POS-tags). Metadata, lexica, frequency lists and the tool Corex which can be used to explore the data are included.
The speech recordings are aligned with several transcriptions (e.g. orthographic, phonetic) and annotations (syntax, POS-tags). Metadata, lexica, frequency lists and the tool Corex which can be used to explore the data are included.

Revision as of 15:09, 15 November 2021

Spoken corpora are corpora that consist of spoken data or material based on spoken data.

Boarnsterhim Corpus (BHC)

The Boarnsterhim Corpus consists of 250 hours of speech in both West Frisian and Dutch by the same sample of bilingual speakers. The corpus contains original recordings from 1982-1984 and a replication study recorded 35 years later. The data collection spans speech of four generations, and combines panel and trend data.

  • 42.6 MB
  • version 1.0 (2020)
  • data set from 1982-1984 + replication 35 years later
  • Download page

COPAS: Corpus Pathologische en Normale Spraak

A collection recordings of almost 200 speakers with an audible speech impediment and a control group of 122 speakers.

Corpus Gesproken Nederlands

The Spoken Dutch Corpus contains almost 9 million words of contemporary spoken Dutch from native speakers in Flanders and the Netherlands.

The speech recordings are aligned with several transcriptions (e.g. orthographic, phonetic) and annotations (syntax, POS-tags). Metadata, lexica, frequency lists and the tool Corex which can be used to explore the data are included.

IFA Spoken Language Corpus

The IFA Spoken Language corpus is a free (GPL) database of hand-segmented Dutch speech. It was constructed with off-the-shelf software using speech from 8 speakers in a variety of speaking styles. For a total of 50,000 words (41 minutes/speaker), speech acquisition and preparation took around 3 person-weeks per speaker.

JASMIN-spraakcorpus

A corpus of contemporary Dutch (Dutch/Flemish) as spoken by children of different age groups, elderly people and non-natives with different mother tongues, and human-machine interaction

SABeD -- Spoken Academic Belgian Dutch

(work in progress)

A corpus of spoken academic Belgian Dutch consisting of at least 200 lectures.