L2 learner corpora: Difference between revisions

From Clarin K-Centre
Jump to navigation Jump to search
Line 21: Line 21:
MulTINCo includes spoken and (longitudinal) written data collected from French-speaking learners of Dutch and English as a second language (L2) in different educational settings (CLIL and traditional L2 classes). The database contains numerous background variables, as well as written productions in the learners’ first language (L1) (viz. French) and productions from native speakers of the learners’ L2 (viz. L1 Dutch and L1 English data).  
MulTINCo includes spoken and (longitudinal) written data collected from French-speaking learners of Dutch and English as a second language (L2) in different educational settings (CLIL and traditional L2 classes). The database contains numerous background variables, as well as written productions in the learners’ first language (L1) (viz. French) and productions from native speakers of the learners’ L2 (viz. L1 Dutch and L1 English data).  


*[[https://corpora.uclouvain.be/catalog/corpus/multinco]]
*[https://corpora.uclouvain.be/catalog/corpus/multinco Corpus webpage]

Revision as of 15:43, 16 November 2021

L2 learner corpora play a crucial role in second language research and pedagogy, allowing for a systematic study of how a learner of a second language acquires the new language on a lexical as well as syntactic level, and how it is influenced by his or her native language. A special characteristic of this type of corpora are the markup of errors and prosodic features of the learners.

Corpus Ondertitelde UVN-Colleges (COUC)

This corpus contains 57 (2020-07-16) subtitled lectures from the Universiteit van Nederland (UVN). Subtitles were added to existing video recordings of lectures of the UVN.

Unlike common subtitles, the subtitles generated in this project are a nearly 100% literal representation of the speech as spoken by the people in the recordings. They contain exact orthographic transcriptions of subsequent words and thus show the peculiarities of the spoken language modality, lacking grammatical coherence typical for written texts. On the other hand, the transcriptions do not contain speaker noises (such as lip smacks or coughs) nor hesitation sounds as "ehm". For the sake of readability punctuation markers were included.

Meertalige Ondertiteldata 2BDutch

This product consists of the subtitle data belonging to the Dutch videos on the website www.2BDutch.nl. The 2BDutch website contains videos with subtitle options in various languages. With these videos, students of all levels of Dutch can practice their listening skills and learn new Dutch words. The subtitle data belonging to these videos can also be used for various language and speech technology applications including automatic translation and automatic speech recognition.

Multilingual Traditional Immersion and Native Corpus

MulTINCo includes spoken and (longitudinal) written data collected from French-speaking learners of Dutch and English as a second language (L2) in different educational settings (CLIL and traditional L2 classes). The database contains numerous background variables, as well as written productions in the learners’ first language (L1) (viz. French) and productions from native speakers of the learners’ L2 (viz. L1 Dutch and L1 English data).