L2 learner corpora: Difference between revisions

From Clarin K-Centre
Jump to navigation Jump to search
No edit summary
No edit summary
Line 31: Line 31:


*[https://corpora.uclouvain.be/catalog/corpus/leerdercorpus-nederlands Corpus webpage]
*[https://corpora.uclouvain.be/catalog/corpus/leerdercorpus-nederlands Corpus webpage]
==LeCoNTra==
LeCoNTra is a learner corpus consisting of English-to-Dutch news translations enriched with translation process data. Three students of a Master’s programme in Translation were asked to translate 50 different English journalistic texts of approximately 250 tokens each. Because we also collected translation process data in the form of keystroke logging, our dataset can be used as part of different research strands such as translation process research, learner corpus research, and corpus-based translation studies. Reference translations, without process data, are also included. The data has been manually segmented and tokenized, and manually aligned at both segment and word level, leading to a high-quality corpus with token-level process data.
* [https://github.com/BramVanroy/LeCoNTra Download page]
* [https://aclanthology.org/2022.lrec-1.192/ Vanroy, B. and Macken, L. (2022). LeConTra: A Learner Corpus of English-to-Dutch News Translation.]

Revision as of 14:49, 1 December 2022

L2 learner corpora play a crucial role in second language research and pedagogy, allowing for a systematic study of how a learner of a second language acquires the new language on a lexical as well as syntactic level, and how it is influenced by his or her native language. A special characteristic of this type of corpora are the markup of errors and prosodic features of the learners.

Corpus Ondertitelde UVN-Colleges (COUC)

This corpus contains 57 (2020-07-16) subtitled lectures from the Universiteit van Nederland (UVN). Subtitles were added to existing video recordings of lectures of the UVN.

Unlike common subtitles, the subtitles generated in this project are a nearly 100% literal representation of the speech as spoken by the people in the recordings. They contain exact orthographic transcriptions of subsequent words and thus show the peculiarities of the spoken language modality, lacking grammatical coherence typical for written texts. On the other hand, the transcriptions do not contain speaker noises (such as lip smacks or coughs) nor hesitation sounds as "ehm". For the sake of readability punctuation markers were included.

Meertalige Ondertiteldata 2BDutch

This product consists of the subtitle data belonging to the Dutch videos on the website www.2BDutch.nl. The 2BDutch website contains videos with subtitle options in various languages. With these videos, students of all levels of Dutch can practice their listening skills and learn new Dutch words. The subtitle data belonging to these videos can also be used for various language and speech technology applications including automatic translation and automatic speech recognition.

Multilingual Traditional Immersion and Native Corpus

MulTINCo includes spoken and (longitudinal) written data collected from French-speaking learners of Dutch and English as a second language (L2) in different educational settings (CLIL and traditional L2 classes). The database contains numerous background variables, as well as written productions in the learners’ first language (L1) (viz. French) and productions from native speakers of the learners’ L2 (viz. L1 Dutch and L1 English data).

Modern Times

Narrations based on an extract from Modern Times (Ch. Chaplin 1934 or 36) by native speakers and learners of Dutch and French.

Leerdercorpus Nederlands

Varied collection of writing tasks for learners of Dutch of different levels.

LeCoNTra

LeCoNTra is a learner corpus consisting of English-to-Dutch news translations enriched with translation process data. Three students of a Master’s programme in Translation were asked to translate 50 different English journalistic texts of approximately 250 tokens each. Because we also collected translation process data in the form of keystroke logging, our dataset can be used as part of different research strands such as translation process research, learner corpus research, and corpus-based translation studies. Reference translations, without process data, are also included. The data has been manually segmented and tokenized, and manually aligned at both segment and word level, leading to a high-quality corpus with token-level process data.