L2 learner corpora: Difference between revisions
No edit summary |
No edit summary |
||
(11 intermediate revisions by 4 users not shown) | |||
Line 1: | Line 1: | ||
L2 learner corpora play a crucial role in second language research and pedagogy, allowing for a systematic study of how a learner of a second language acquires the new language on a lexical as well as syntactic level, and how it is influenced by his or her native language. A special characteristic of this type of corpora are the markup of errors and prosodic features of the learners. | <languages/> | ||
<translate> | |||
<!--T:1--> | |||
L2 learner corpora play a crucial role in second language research and pedagogy, allowing for a systematic study of how a learner of a second language acquires the new language on a lexical as well as syntactic level, and how it is influenced by his or her native language. A special characteristic of this type of corpora are the markup of errors and prosodic features of the learners. | |||
For more information and resources, visit the [https://uclouvain.be/en/research-institutes/ilc/clarin-knowledge-centre-for-learner-corpora.html '''''CLARIN Knowledge Center for Learner Corpora'''''] | |||
<!--T:2--> | |||
==Corpus Ondertitelde UVN-Colleges (COUC)== | ==Corpus Ondertitelde UVN-Colleges (COUC)== | ||
This corpus contains 57 (2020-07-16) subtitled lectures from the Universiteit van Nederland (UVN). Subtitles were added to existing video recordings of lectures of the UVN. | This corpus contains 57 (2020-07-16) subtitled lectures from the Universiteit van Nederland (UVN). Subtitles were added to existing video recordings of lectures of the UVN. | ||
<!--T:3--> | |||
Unlike common subtitles, the subtitles generated in this project are a nearly 100% literal representation of the speech as spoken by the people in the recordings. They contain exact orthographic transcriptions of subsequent words and thus show the peculiarities of the spoken language modality, lacking grammatical coherence typical for written texts. | Unlike common subtitles, the subtitles generated in this project are a nearly 100% literal representation of the speech as spoken by the people in the recordings. They contain exact orthographic transcriptions of subsequent words and thus show the peculiarities of the spoken language modality, lacking grammatical coherence typical for written texts. | ||
On the other hand, the transcriptions do not contain speaker noises (such as lip smacks or coughs) nor hesitation sounds as "ehm". For the sake of readability punctuation markers were included. | On the other hand, the transcriptions do not contain speaker noises (such as lip smacks or coughs) nor hesitation sounds as "ehm". For the sake of readability punctuation markers were included. | ||
<!--T:4--> | |||
*22 MB | *22 MB | ||
*version 1.0 (2020) | *version 1.0 (2020) | ||
*[http://hdl.handle.net/10032/tm-a2-s3 Download page] | *[http://hdl.handle.net/10032/tm-a2-s3 Download page] | ||
<!--T:5--> | |||
==Meertalige Ondertiteldata 2BDutch== | ==Meertalige Ondertiteldata 2BDutch== | ||
This product consists of the subtitle data belonging to the Dutch videos on the website www.2BDutch.nl. The 2BDutch website contains videos with subtitle options in various languages. With these videos, students of all levels of Dutch can practice their listening skills and learn new Dutch words. The subtitle data belonging to these videos can also be used for various language and speech technology applications including automatic translation and automatic speech recognition. | This product consists of the subtitle data belonging to the Dutch videos on the website www.2BDutch.nl. The 2BDutch website contains videos with subtitle options in various languages. With these videos, students of all levels of Dutch can practice their listening skills and learn new Dutch words. The subtitle data belonging to these videos can also be used for various language and speech technology applications including automatic translation and automatic speech recognition. | ||
<!--T:6--> | |||
*36 KB | *36 KB | ||
*version 1.0 (2020) | *version 1.0 (2020) | ||
*[http://hdl.handle.net/10032/tm-a2-m5 Download page] | *[http://hdl.handle.net/10032/tm-a2-m5 Download page] | ||
<!--T:7--> | |||
==Multilingual Traditional Immersion and Native Corpus == | ==Multilingual Traditional Immersion and Native Corpus == | ||
MulTINCo includes spoken and (longitudinal) written data collected from French-speaking learners of Dutch and English as a second language (L2) in different educational settings (CLIL and traditional L2 classes). The database contains numerous background variables, as well as written productions in the learners’ first language (L1) (viz. French) and productions from native speakers of the learners’ L2 (viz. L1 Dutch and L1 English data). | MulTINCo includes spoken and (longitudinal) written data collected from French-speaking learners of Dutch and English as a second language (L2) in different educational settings (CLIL and traditional L2 classes). The database contains numerous background variables, as well as written productions in the learners’ first language (L1) (viz. French) and productions from native speakers of the learners’ L2 (viz. L1 Dutch and L1 English data). | ||
<!--T:8--> | |||
*[https://corpora.uclouvain.be/catalog/corpus/multinco Corpus webpage] | *[https://corpora.uclouvain.be/catalog/corpus/multinco Corpus webpage] | ||
<!--T:9--> | |||
==Modern Times== | ==Modern Times== | ||
Narrations based on an extract from Modern Times (Ch. Chaplin 1934 or 36) by native speakers and learners of Dutch and French. | Narrations based on an extract from Modern Times (Ch. Chaplin 1934 or 36) by native speakers and learners of Dutch and French. | ||
*[https://corpora.uclouvain.be/catalog/corpus/modern-times Corpus webpage (currently dead)] | *[https://corpora.uclouvain.be/catalog/corpus/modern-times Corpus webpage (currently dead)] | ||
<!--T:10--> | |||
==Leerdercorpus Nederlands== | |||
Varied collection of writing tasks for learners of Belgian Dutch of different levels. The corpus consists of approx. 775.000 words and the texts were written in the periode 1998-2007. | |||
<!--T:11--> | |||
*[https://corpora.uclouvain.be/catalog/corpus/leerdercorpus-nederlands Corpus webpage] | |||
<!--T:12--> | |||
==LeCoNTra== | |||
LeCoNTra is a learner corpus consisting of English-to-Dutch news translations enriched with translation process data. Three students of a Master’s programme in Translation were asked to translate 50 different English journalistic texts of approximately 250 tokens each. Because we also collected translation process data in the form of keystroke logging, our dataset can be used as part of different research strands such as translation process research, learner corpus research, and corpus-based translation studies. Reference translations, without process data, are also included. The data has been manually segmented and tokenized, and manually aligned at both segment and word level, leading to a high-quality corpus with token-level process data. | |||
<!--T:13--> | |||
* [https://github.com/BramVanroy/LeCoNTra Download page] | |||
* [https://aclanthology.org/2022.lrec-1.192/ Vanroy, B. and Macken, L. (2022). LeConTra: A Learner Corpus of English-to-Dutch News Translation.] | |||
</translate> |
Latest revision as of 14:20, 14 March 2024
L2 learner corpora play a crucial role in second language research and pedagogy, allowing for a systematic study of how a learner of a second language acquires the new language on a lexical as well as syntactic level, and how it is influenced by his or her native language. A special characteristic of this type of corpora are the markup of errors and prosodic features of the learners. For more information and resources, visit the CLARIN Knowledge Center for Learner Corpora
Corpus Ondertitelde UVN-Colleges (COUC)
This corpus contains 57 (2020-07-16) subtitled lectures from the Universiteit van Nederland (UVN). Subtitles were added to existing video recordings of lectures of the UVN.
Unlike common subtitles, the subtitles generated in this project are a nearly 100% literal representation of the speech as spoken by the people in the recordings. They contain exact orthographic transcriptions of subsequent words and thus show the peculiarities of the spoken language modality, lacking grammatical coherence typical for written texts. On the other hand, the transcriptions do not contain speaker noises (such as lip smacks or coughs) nor hesitation sounds as "ehm". For the sake of readability punctuation markers were included.
- 22 MB
- version 1.0 (2020)
- Download page
Meertalige Ondertiteldata 2BDutch
This product consists of the subtitle data belonging to the Dutch videos on the website www.2BDutch.nl. The 2BDutch website contains videos with subtitle options in various languages. With these videos, students of all levels of Dutch can practice their listening skills and learn new Dutch words. The subtitle data belonging to these videos can also be used for various language and speech technology applications including automatic translation and automatic speech recognition.
- 36 KB
- version 1.0 (2020)
- Download page
Multilingual Traditional Immersion and Native Corpus
MulTINCo includes spoken and (longitudinal) written data collected from French-speaking learners of Dutch and English as a second language (L2) in different educational settings (CLIL and traditional L2 classes). The database contains numerous background variables, as well as written productions in the learners’ first language (L1) (viz. French) and productions from native speakers of the learners’ L2 (viz. L1 Dutch and L1 English data).
Modern Times
Narrations based on an extract from Modern Times (Ch. Chaplin 1934 or 36) by native speakers and learners of Dutch and French.
Leerdercorpus Nederlands
Varied collection of writing tasks for learners of Belgian Dutch of different levels. The corpus consists of approx. 775.000 words and the texts were written in the periode 1998-2007.
LeCoNTra
LeCoNTra is a learner corpus consisting of English-to-Dutch news translations enriched with translation process data. Three students of a Master’s programme in Translation were asked to translate 50 different English journalistic texts of approximately 250 tokens each. Because we also collected translation process data in the form of keystroke logging, our dataset can be used as part of different research strands such as translation process research, learner corpus research, and corpus-based translation studies. Reference translations, without process data, are also included. The data has been manually segmented and tokenized, and manually aligned at both segment and word level, leading to a high-quality corpus with token-level process data.