Spoken corpora: Difference between revisions

From Clarin K-Centre
Jump to navigation Jump to search
No edit summary
 
(7 intermediate revisions by 2 users not shown)
Line 1: Line 1:
<languages/>
<translate>
<!--T:1-->
Spoken corpora are corpora that consist of spoken data or material based on spoken data.
Spoken corpora are corpora that consist of spoken data or material based on spoken data.


==Boarnsterhim Corpus (BHC)==
<!--T:2-->
==Boarnsterhim Corpus (BHC) (Currently unavailable)==
The Boarnsterhim Corpus consists of 250 hours of speech in both West Frisian and Dutch by the same sample of bilingual speakers. The corpus contains original recordings from 1982-1984 and a replication study recorded 35 years later. The data collection spans speech of four generations, and combines panel and trend data.
The Boarnsterhim Corpus consists of 250 hours of speech in both West Frisian and Dutch by the same sample of bilingual speakers. The corpus contains original recordings from 1982-1984 and a replication study recorded 35 years later. The data collection spans speech of four generations, and combines panel and trend data.
 
''##This corpus is temporarily unavailable because because it is under revision. For more information, please contact Hans van de Velde (HvandeVelde@fryske-akademy.nl) or Wilbert Heeringa, datamanager of the Fryske Akademy (wheeringa@fryske-akademy.nl).##
''
*42.6 MB
*42.6 MB
*version 1.0 (2020)
*version 1.0 (2020)
Line 9: Line 15:
*[http://hdl.handle.net/10032/tm-a2-r4 Download page]
*[http://hdl.handle.net/10032/tm-a2-r4 Download page]


<!--T:3-->
== COPAS: Corpus Pathologische en Normale Spraak ==
== COPAS: Corpus Pathologische en Normale Spraak ==
A collection recordings of almost 200 speakers with an audible speech impediment and a control group of 122 speakers.
A collection recordings of almost 200 speakers with an audible speech impediment and a control group of 122 speakers.


<!--T:4-->
* Belgian Dutch
* Belgian Dutch
* [http://hdl.handle.net/10032/tm-a2-n3 Download page]
* [http://hdl.handle.net/10032/tm-a2-n3 Download page]
* [https://www.esat.kuleuven.be/psi/spraak/projects/SPACE/ Project page]
* [https://www.esat.kuleuven.be/psi/spraak/projects/SPACE/ Project page]


<!--T:5-->
==Corpus Gesproken Nederlands==  
==Corpus Gesproken Nederlands==  
The Spoken Dutch Corpus contains almost 9 million words of contemporary spoken Dutch from native speakers in Flanders and the Netherlands.
The Spoken Dutch Corpus contains almost 9 million words of contemporary spoken Dutch from native speakers in Flanders and the Netherlands.


<!--T:6-->
The speech recordings are aligned with several transcriptions (e.g. orthographic, phonetic) and annotations (syntax, POS-tags). Metadata, lexica, frequency lists and the tool Corex which can be used to explore the data are included.
The speech recordings are aligned with several transcriptions (e.g. orthographic, phonetic) and annotations (syntax, POS-tags). Metadata, lexica, frequency lists and the tool Corex which can be used to explore the data are included.


<!--T:7-->
* 900 hours of spoken Dutch
* 900 hours of spoken Dutch
* 1998 - 2004
* 1998 - 2004
Line 30: Line 41:
* [https://portal.clarin.inl.nl/opensonar_frontend/opensonar/search Online search with OpenSonar].  If you go to ''Extended Mode'' you can select to exclusively search in the Corpus Spoken Dutch. (See [[Corpus querying]] for more information on OpenSonar.)
* [https://portal.clarin.inl.nl/opensonar_frontend/opensonar/search Online search with OpenSonar].  If you go to ''Extended Mode'' you can select to exclusively search in the Corpus Spoken Dutch. (See [[Corpus querying]] for more information on OpenSonar.)


<!--T:8-->
==IFA Spoken Language Corpus==
==IFA Spoken Language Corpus==
The IFA Spoken Language corpus is a free (GPL) database of hand-segmented Dutch speech at the phoneme level. It was constructed with off-the-shelf software using speech from 8 speakers in a variety of speaking styles. For a total of 50,000 words (41 minutes/speaker), speech acquisition and preparation took around 3 person-weeks per speaker.  
The IFA Spoken Language corpus is a free (GPL) database of hand-segmented Dutch speech at the phoneme level. It was constructed with off-the-shelf software using speech from 8 speakers in a variety of speaking styles. For a total of 50,000 words (41 minutes/speaker), speech acquisition and preparation took around 3 person-weeks per speaker.  


<!--T:9-->
*version 1.0 (2001)
*version 1.0 (2001)
*4.6 MB
*4.6 MB
Line 38: Line 51:
*[https://www.fon.hum.uva.nl/IFA-SpokenLanguageCorpora/IFAcorpus/ Project website]
*[https://www.fon.hum.uva.nl/IFA-SpokenLanguageCorpora/IFAcorpus/ Project website]


==JASMIN-spraakcorpus==
==JASMIN-spraakcorpus== <!--T:10-->


<!--T:11-->
A corpus of contemporary Dutch (Dutch/Flemish) as spoken by children of different age groups, elderly people and non-natives with different mother tongues, and human-machine interaction
A corpus of contemporary Dutch (Dutch/Flemish) as spoken by children of different age groups, elderly people and non-natives with different mother tongues, and human-machine interaction


<!--T:12-->
* 115 hours of spoken Dutch
* 115 hours of spoken Dutch
* speech of children, elderly people and non-natives, and human-machine interaction
* speech of children, elderly people and non-natives, and human-machine interaction
Line 49: Line 64:
* [http://hdl.handle.net/10032/tm-a2-j7 Download page]
* [http://hdl.handle.net/10032/tm-a2-j7 Download page]


<!--T:13-->
==SABeD -- Spoken Academic Belgian Dutch==
==SABeD -- Spoken Academic Belgian Dutch==
The SABeD corpus collection project has started on the 1st of March 2021 and is not yet available. The corpus of spoken academic Belgian Dutch will consist of at least 200 lectures.  
The SABeD corpus collection project has started on the 1st of March 2021 and is not yet available. The corpus of spoken academic Belgian Dutch will consist of at least 200 lectures.  


<!--T:14-->
* [https://www.arts.kuleuven.be/ling/language-education-society/projects/sabed Project website]
* [https://www.arts.kuleuven.be/ling/language-education-society/projects/sabed Project website]


<!--T:15-->
==AUTONOMATA-namencorpus==
==AUTONOMATA-namencorpus==
The AUTONOMATA Spoken Names Corpus is a database with in total about 5000 read first names, surnames, straat names, city names and check words.
The AUTONOMATA Spoken Names Corpus is a database with in total about 5000 read first names, surnames, straat names, city names and check words.


<!--T:16-->
* version 1.0 (2008)
* version 1.0 (2008)
* [https://taalmaterialen.ivdnt.org/wp-content/uploads/documentatie/auto-nc_lrec2006_en.pdf Paper]
* [https://taalmaterialen.ivdnt.org/wp-content/uploads/documentatie/auto-nc_lrec2006_en.pdf Paper]
* [http://hdl.handle.net/10032/tm-a2-m2 Download page]
* [http://hdl.handle.net/10032/tm-a2-m2 Download page]


<!--T:17-->
==AUTONOMATA-POI-corpus==
==AUTONOMATA-POI-corpus==
The AUTONOMATA POI Corpus is a corpus of 800 pronounced points of interest from the Netherlands and Belgium containing names of restaurants, camping sites, cafés, etc.
The AUTONOMATA POI Corpus is a corpus of 800 pronounced points of interest from the Netherlands and Belgium containing names of restaurants, camping sites, cafés, etc.


<!--T:18-->
* [https://taalmaterialen.ivdnt.org/wp-content/uploads/documentatie/auto-poi_documentatie_nl.pdf Documentation]  
* [https://taalmaterialen.ivdnt.org/wp-content/uploads/documentatie/auto-poi_documentatie_nl.pdf Documentation]  
* [[http://lands.let.ru.nl/projects/AutonomataToo/index.php Project website]
* [[http://lands.let.ru.nl/projects/AutonomataToo/index.php Project website]
* [http://hdl.handle.net/10032/tm-a2-n7 Download page]
* [http://hdl.handle.net/10032/tm-a2-n7 Download page]


<!--T:19-->
==Children's Oral Reading Corpus (CHOREC)==
==Children's Oral Reading Corpus (CHOREC)==
The CHOREC Corpus contains recorded, transcribed and annotated read speech (42 GB or 130 hours) of 400 Dutch speaking elementary school children with or without reading difficulties. Analyses of inter- and intra-annotator agreement are carried out in order to investigate the consistency with which reading errors are detected, orthographic and phonetic transcriptions are made, and reading errors and reading strategies are labeled.
The CHOREC Corpus contains recorded, transcribed and annotated read speech (42 GB or 130 hours) of 400 Dutch speaking elementary school children with or without reading difficulties. Analyses of inter- and intra-annotator agreement are carried out in order to investigate the consistency with which reading errors are detected, orthographic and phonetic transcriptions are made, and reading errors and reading strategies are labeled.


<!--T:20-->
* [https://taalmaterialen.ivdnt.org/wp-content/uploads/documentatie/chorec_documentatie_en.pdf Paper]
* [https://taalmaterialen.ivdnt.org/wp-content/uploads/documentatie/chorec_documentatie_en.pdf Paper]
* [https://www.esat.kuleuven.be/psi/spraak/projects/SPACE/ Project page]
* [https://www.esat.kuleuven.be/psi/spraak/projects/SPACE/ Project page]
* [http://hdl.handle.net/10032/tm-a2-j5 Download page]
* [http://hdl.handle.net/10032/tm-a2-j5 Download page]
<!--T:21-->
==BLISS Dialogue Summaries==
This dataset consists of Dutch recordings of participants talking with the BLISS dialogue system about their everyday occupations and their favorite activities. The corpus contains 55 recordings with an average duration of 2 minutes and 34 seconds.
<!--T:22-->
*[https://hstrik.ruhosting.nl/bliss/ Project page]
*[http://hdl.handle.net/10032/tm-a2-v3 Download page]
</translate>

Latest revision as of 10:12, 21 March 2024

Other languages:

Spoken corpora are corpora that consist of spoken data or material based on spoken data.

Boarnsterhim Corpus (BHC) (Currently unavailable)

The Boarnsterhim Corpus consists of 250 hours of speech in both West Frisian and Dutch by the same sample of bilingual speakers. The corpus contains original recordings from 1982-1984 and a replication study recorded 35 years later. The data collection spans speech of four generations, and combines panel and trend data. ##This corpus is temporarily unavailable because because it is under revision. For more information, please contact Hans van de Velde (HvandeVelde@fryske-akademy.nl) or Wilbert Heeringa, datamanager of the Fryske Akademy (wheeringa@fryske-akademy.nl).##

  • 42.6 MB
  • version 1.0 (2020)
  • data set from 1982-1984 + replication 35 years later
  • Download page

COPAS: Corpus Pathologische en Normale Spraak

A collection recordings of almost 200 speakers with an audible speech impediment and a control group of 122 speakers.

Corpus Gesproken Nederlands

The Spoken Dutch Corpus contains almost 9 million words of contemporary spoken Dutch from native speakers in Flanders and the Netherlands.

The speech recordings are aligned with several transcriptions (e.g. orthographic, phonetic) and annotations (syntax, POS-tags). Metadata, lexica, frequency lists and the tool Corex which can be used to explore the data are included.

IFA Spoken Language Corpus

The IFA Spoken Language corpus is a free (GPL) database of hand-segmented Dutch speech at the phoneme level. It was constructed with off-the-shelf software using speech from 8 speakers in a variety of speaking styles. For a total of 50,000 words (41 minutes/speaker), speech acquisition and preparation took around 3 person-weeks per speaker.

JASMIN-spraakcorpus

A corpus of contemporary Dutch (Dutch/Flemish) as spoken by children of different age groups, elderly people and non-natives with different mother tongues, and human-machine interaction

SABeD -- Spoken Academic Belgian Dutch

The SABeD corpus collection project has started on the 1st of March 2021 and is not yet available. The corpus of spoken academic Belgian Dutch will consist of at least 200 lectures.

AUTONOMATA-namencorpus

The AUTONOMATA Spoken Names Corpus is a database with in total about 5000 read first names, surnames, straat names, city names and check words.

AUTONOMATA-POI-corpus

The AUTONOMATA POI Corpus is a corpus of 800 pronounced points of interest from the Netherlands and Belgium containing names of restaurants, camping sites, cafés, etc.

Children's Oral Reading Corpus (CHOREC)

The CHOREC Corpus contains recorded, transcribed and annotated read speech (42 GB or 130 hours) of 400 Dutch speaking elementary school children with or without reading difficulties. Analyses of inter- and intra-annotator agreement are carried out in order to investigate the consistency with which reading errors are detected, orthographic and phonetic transcriptions are made, and reading errors and reading strategies are labeled.

BLISS Dialogue Summaries

This dataset consists of Dutch recordings of participants talking with the BLISS dialogue system about their everyday occupations and their favorite activities. The corpus contains 55 recordings with an average duration of 2 minutes and 34 seconds.