Spoken corpora: Difference between revisions

From Clarin K-Centre
Jump to navigation Jump to search
No edit summary
No edit summary
Line 1: Line 1:
Spoken corpora are corpora that consist of spoken data or material based on spoken data.
Spoken corpora are corpora that consist of spoken data or material based on spoken data.


We have the following corpora:
==Corpus Gesproken Nederlands==
(Spoken Dutch Corpus)
 
* 900 hours of spoken Dutch
* 1998 - 2004
* tagged, lemmatized, annotated (orthographic/phonetic)
* corpus exploration software (Corex)
* version 2.0.3.
* [https://taalmaterialen.ivdnt.org/wp-content/uploads/documentatie/cgn_website/doc_English/start.htm Project website]
* [http://hdl.handle.net/10032/tm-a2-k6 http://hdl.handle.net/10032/tm-a2-k6 Download page]
* Online search: [https://portal.clarin.inl.nl/opensonar_frontend/opensonar/search OpenSonar].  If you go to ''Extended Mode'' you can select to exclusively search in the Corpus Spoken Dutch.
 
=== Description ===
The Corpus Gesproken Nederlands (Corpus Spoken Dutch) is a collection of 900 hours (almost 9 million words) of contemporary spoken Dutch from native speakers in Flanders and the Netherlands.
 
The speech recordings are aligned with several transcriptions (e.g. orthographic, phonetic) and annotations (syntax, POS-tags). Metadata, lexica, frequency lists and the tool Corex which can be used to explore the data are included.


* [[Corpus Gesproken Nederlands (CGN)]]: Spoken Dutch Corpus (Dutch/Flemish)
* [[JASMIN-spraakcorpus]]: a corpus of contemporary Dutch (Dutch/Flemish) as spoken by children of different age groups, elderly people and non-natives with different mother tongues, and human-machine interaction
* [[JASMIN-spraakcorpus]]: a corpus of contemporary Dutch (Dutch/Flemish) as spoken by children of different age groups, elderly people and non-natives with different mother tongues, and human-machine interaction

Revision as of 09:19, 2 March 2021

Spoken corpora are corpora that consist of spoken data or material based on spoken data.

Corpus Gesproken Nederlands

(Spoken Dutch Corpus)

Description

The Corpus Gesproken Nederlands (Corpus Spoken Dutch) is a collection of 900 hours (almost 9 million words) of contemporary spoken Dutch from native speakers in Flanders and the Netherlands.

The speech recordings are aligned with several transcriptions (e.g. orthographic, phonetic) and annotations (syntax, POS-tags). Metadata, lexica, frequency lists and the tool Corex which can be used to explore the data are included.

  • JASMIN-spraakcorpus: a corpus of contemporary Dutch (Dutch/Flemish) as spoken by children of different age groups, elderly people and non-natives with different mother tongues, and human-machine interaction