Parallel Multilingual Corpora: Difference between revisions
(Created page with "==Bible Corpus== A diachronically and synchronically parallel corpus of Bible translations in Dutch, English, German and Swedish, with texts from the 14th century until today. *[https://spraakbanken.gu.se/en/resources/openedges OpenEdges Download] ==PacoMT Parallel Corpora== During the STEVIN project PaCo-MT (Parse and Corpus-based Machine Translation), two existing parallel corpora were enriched with syntactic annotations and node alignments. The annotations were gen...") |
(Marked this version for translation) |
||
(10 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
==Bible Corpus== | <languages/> | ||
<translate> | |||
==EDGeS Diachronic Bible Corpus== <!--T:1--> | |||
<!--T:2--> | |||
A diachronically and synchronically parallel corpus of Bible translations in Dutch, English, German and Swedish, with texts from the 14th century until today. | A diachronically and synchronically parallel corpus of Bible translations in Dutch, English, German and Swedish, with texts from the 14th century until today. | ||
<!--T:3--> | |||
*[https://spraakbanken.gu.se/en/resources/openedges OpenEdges Download] | *[https://spraakbanken.gu.se/en/resources/openedges OpenEdges Download] | ||
==PacoMT Parallel Corpora== | ==PacoMT Parallel Corpora== <!--T:4--> | ||
<!--T:5--> | |||
During the STEVIN project PaCo-MT (Parse and Corpus-based Machine Translation), two existing parallel corpora were enriched with syntactic annotations and node alignments. The annotations were generated automatically. | During the STEVIN project PaCo-MT (Parse and Corpus-based Machine Translation), two existing parallel corpora were enriched with syntactic annotations and node alignments. The annotations were generated automatically. | ||
<!--T:6--> | |||
Language Pairs: English to Dutch, Dutch to English, French to Dutch, Dutch to French. | Language Pairs: English to Dutch, Dutch to English, French to Dutch, Dutch to French. | ||
<!--T:7--> | |||
*version 1.0 | *version 1.0 | ||
*data set from 2014 | *data set from 2014 | ||
Line 15: | Line 25: | ||
*[http://www.ccl.kuleuven.be/Projects/PACO/paco.php Project website] | *[http://www.ccl.kuleuven.be/Projects/PACO/paco.php Project website] | ||
==The Dutch Parallel Corpus== | ==The Dutch Parallel Corpus== <!--T:8--> | ||
<!--T:9--> | |||
The Dutch Parallel Corpus (DPC) is a 10-million-word, sentence-aligned parallel corpus for the language pairs Dutch-English and Dutch-French, with Dutch as the central language. | The Dutch Parallel Corpus (DPC) is a 10-million-word, sentence-aligned parallel corpus for the language pairs Dutch-English and Dutch-French, with Dutch as the central language. | ||
<!--T:10--> | |||
The corpus contains five different text types and is balanced with respect to text type and translation direction. The entire corpus has been aligned at sentence level and further enriched with linguistic information (lemmas and PoS-tags). A small subset of the Dutch-English part has also been manually aligned at the sub-sentential level. | The corpus contains five different text types and is balanced with respect to text type and translation direction. The entire corpus has been aligned at sentence level and further enriched with linguistic information (lemmas and PoS-tags). A small subset of the Dutch-English part has also been manually aligned at the sub-sentential level. | ||
<!--T:11--> | |||
*[http://hdl.handle.net/10032/tm-a2-h3 Download page] | *[http://hdl.handle.net/10032/tm-a2-h3 Download page] | ||
*[https://www.kuleuven-kulak.be/dpc/en/ Project website] | *[https://www.kuleuven-kulak.be/dpc/en/ Project website] | ||
==The Open Parallel Corpus (OPUS)== | ==The Open Parallel Corpus (OPUS)== <!--T:12--> | ||
<!--T:13--> | |||
The [https://opus.nlpl.eu/ OPUS corpus] contains a very large collection of parallel corpora, amongst which many contain Dutch. | The [https://opus.nlpl.eu/ OPUS corpus] contains a very large collection of parallel corpora, amongst which many contain Dutch. | ||
==COVID-19 - HEALTH Wikipedia dataset. Bilingual (EN-NL)== | ==COVID-19 Corpora== <!--T:14--> | ||
===COVID-19 - HEALTH Wikipedia dataset. Bilingual (EN-NL)=== <!--T:15--> | |||
<!--T:16--> | |||
Bilingual (EN-NL) corpus acquired from Wikipedia on health and COVID-19 domain (2nd May 2020). The corpus contains 931 translation units. | Bilingual (EN-NL) corpus acquired from Wikipedia on health and COVID-19 domain (2nd May 2020). The corpus contains 931 translation units. | ||
<!--T:17--> | |||
* Version 1.0 (2020) | * Version 1.0 (2020) | ||
* [https://elrc-share.eu/repository/browse/covid-19-health-wikipedia-dataset-bilingual-en-nl/b36eccb88de811ea913100155d0267065632b235f586445aa0c67da0afcdfc0e/ Download page] | * [https://elrc-share.eu/repository/browse/covid-19-health-wikipedia-dataset-bilingual-en-nl/b36eccb88de811ea913100155d0267065632b235f586445aa0c67da0afcdfc0e/ Download page] | ||
==COVID-19 ANTIBIOTIC dataset. Bilingual (EN-NL)== | ===COVID-19 ANTIBIOTIC dataset. Bilingual (EN-NL)=== <!--T:18--> | ||
<!--T:19--> | |||
Bilingual (EN-NL) corpus acquired from the website https://antibiotic.ecdc.europa.eu/. The corpus contains 805 translation units. | Bilingual (EN-NL) corpus acquired from the website https://antibiotic.ecdc.europa.eu/. The corpus contains 805 translation units. | ||
<!--T:20--> | |||
* Version 1.0 (2020) | * Version 1.0 (2020) | ||
* [https://elrc-share.eu/repository/browse/covid-19-antibiotic-dataset-bilingual-en-nl/9c5009c0904511ea913100155d026706169da04f5eb448178c8954eb8f874db1/ Download page] | * [https://elrc-share.eu/repository/browse/covid-19-antibiotic-dataset-bilingual-en-nl/9c5009c0904511ea913100155d026706169da04f5eb448178c8954eb8f874db1/ Download page] | ||
== COVID-19 EC-EUROPA v1 dataset. Bilingual (EN-NL)== | === COVID-19 EC-EUROPA v1 dataset. Bilingual (EN-NL)=== <!--T:21--> | ||
<!--T:22--> | |||
Bilingual (EN-NL) corpus acquired from website (https://ec.europa.eu/*coronavirus-response) of the EU portal (20th May 2020). This corpus contains 2.391 translation units. | Bilingual (EN-NL) corpus acquired from website (https://ec.europa.eu/*coronavirus-response) of the EU portal (20th May 2020). This corpus contains 2.391 translation units. | ||
<!--T:23--> | |||
* Version 1.0 (2020) | * Version 1.0 (2020) | ||
*[https://elrc-share.eu/repository/browse/covid-19-ec-europa-v1-dataset-bilingual-en-nl/c839dc1aa17911ea913100155d0267065bd070800d534300b9a82cbc55176caa/ Download page] | *[https://elrc-share.eu/repository/browse/covid-19-ec-europa-v1-dataset-bilingual-en-nl/c839dc1aa17911ea913100155d0267065bd070800d534300b9a82cbc55176caa/ Download page] | ||
==COVID-19 EU presscorner v2 dataset. Bilingual (EN-NL)== | ===COVID-19 EU presscorner v2 dataset. Bilingual (EN-NL)=== <!--T:24--> | ||
<!--T:25--> | |||
Bilingual (EN-NL) corpus acquired from website (https://ec.europa.eu/commission/presscorner/) of the EU portal (8th July 2020). This corpus contains 6.810 translation units. | Bilingual (EN-NL) corpus acquired from website (https://ec.europa.eu/commission/presscorner/) of the EU portal (8th July 2020). This corpus contains 6.810 translation units. | ||
<!--T:26--> | |||
* Version 2.0 (2020) | * Version 2.0 (2020) | ||
* [https://elrc-share.eu/repository/browse/covid-19-eu-presscorner-v2-dataset-bilingual-en-nl/c924966ac5c811ea913100155d0267060010380f855d42b188d6225ace812c61/ Download page] | * [https://elrc-share.eu/repository/browse/covid-19-eu-presscorner-v2-dataset-bilingual-en-nl/c924966ac5c811ea913100155d0267060010380f855d42b188d6225ace812c61/ Download page] | ||
==COVID-19 EUR-LEX dataset. Βilingual (EN-NL)== | ===COVID-19 EUR-LEX dataset. Βilingual (EN-NL)=== <!--T:27--> | ||
<!--T:28--> | |||
Bilingual (EN-NL) corpus acquired from website (https://eur-lex.europa.eu/legal-content) of the EU portal (9th July 2020). This corpus contains 22.470 translation units. | Bilingual (EN-NL) corpus acquired from website (https://eur-lex.europa.eu/legal-content) of the EU portal (9th July 2020). This corpus contains 22.470 translation units. | ||
<!--T:29--> | |||
* Version 1.0 (2020) | * Version 1.0 (2020) | ||
* [https://elrc-share.eu/repository/browse/covid-19-eur-lex-dataset-ilingual-en-nl/af906a80c5af11ea913100155d026706dc95cf79c8104ea2b5c9e7143216e8b6/ Download page] | * [https://elrc-share.eu/repository/browse/covid-19-eur-lex-dataset-ilingual-en-nl/af906a80c5af11ea913100155d026706dc95cf79c8104ea2b5c9e7143216e8b6/ Download page] | ||
==COVID-19 EUROPARL v2 dataset. Bilingual (EN-NL) == | ===COVID-19 EUROPARL v2 dataset. Bilingual (EN-NL) === <!--T:30--> | ||
<!--T:31--> | |||
Bilingual (EN-NL) corpus acquired from the website (https://www.europarl.europa.eu/) of the European Parliament (9th May 2020). This corpus contains 887 translation units. | Bilingual (EN-NL) corpus acquired from the website (https://www.europarl.europa.eu/) of the European Parliament (9th May 2020). This corpus contains 887 translation units. | ||
<!--T:32--> | |||
* Version 2.0 (2020) | * Version 2.0 (2020) | ||
* [https://elrc-share.eu/repository/browse/covid-19-europarl-v2-dataset-bilingual-en-nl/aca366f4941f11ea913100155d0267066f2c95e65e20479ba769a4ec18bb3373/ Download page] | * [https://elrc-share.eu/repository/browse/covid-19-europarl-v2-dataset-bilingual-en-nl/aca366f4941f11ea913100155d0267066f2c95e65e20479ba769a4ec18bb3373/ Download page] | ||
==COVID-19 Parallel Global Voices dataset. Bilingual (EN-NL)== | ===COVID-19 Parallel Global Voices dataset. Bilingual (EN-NL)=== <!--T:33--> | ||
<!--T:34--> | |||
EN-NL Bilingual COVID-19-related corpus acquired from the website (https://globalvoices.org/) of GlobalVoices (28th April 2020). This corpus contains 675 translation units. | EN-NL Bilingual COVID-19-related corpus acquired from the website (https://globalvoices.org/) of GlobalVoices (28th April 2020). This corpus contains 675 translation units. | ||
<!--T:35--> | |||
* Version 1.0 (2020) | * Version 1.0 (2020) | ||
* [https://elrc-share.eu/repository/browse/covid-19-parallel-global-voices-dataset-bilingual-en-nl/df312cf0895211ea913100155d02670693358ccdbdf24ae79e142e3999159478/ Download page] | * [https://elrc-share.eu/repository/browse/covid-19-parallel-global-voices-dataset-bilingual-en-nl/df312cf0895211ea913100155d02670693358ccdbdf24ae79e142e3999159478/ Download page] | ||
==Bilingual corpus from the European Vaccination Portal (NL-EN)== | ==Bilingual corpus from the European Vaccination Portal (NL-EN)== <!--T:36--> | ||
<!--T:37--> | |||
NL-EN Bilingual corpus acquired from https://vaccination-info.eu. This corpus contains 494 translation units. | NL-EN Bilingual corpus acquired from https://vaccination-info.eu. This corpus contains 494 translation units. | ||
<!--T:38--> | |||
* Version 1.0 (2020) | * Version 1.0 (2020) | ||
* [https://elrc-share.eu/repository/browse/bilingual-corpus-from-the-european-vaccination-portal-nl-en/416f3388864e11ea913100155d026706f6cf8712d2304ecfa917aac7e5eb6731/ Download page] | * [https://elrc-share.eu/repository/browse/bilingual-corpus-from-the-european-vaccination-portal-nl-en/416f3388864e11ea913100155d026706f6cf8712d2304ecfa917aac7e5eb6731/ Download page] | ||
==Bilingual corpus from the Publications Office of the EU on the medical domain v.2 (EN-NL) == | ==Bilingual corpus from the Publications Office of the EU on the medical domain v.2 (EN-NL) == <!--T:39--> | ||
<!--T:40--> | |||
EN-NL Bilingual corpus extracted from the Publications Office of the EU on the medical domain. These are sourced from laws, studies, EC announcements, etc. labelled with concepts like epidemiology, epidemic, disease surveillance, health control, public hygiene, freedom of movement, distance learning, etc. This corpus contains 13.191 translation units. | |||
<!--T:41--> | |||
* Version 2.0 (2020) | * Version 2.0 (2020) | ||
* [https://elrc-share.eu/repository/browse/bilingual-corpus-from-the-publications-office-of-the-eu-on-the-medical-domain-v2-en-nl/0795a5328ac411ea913100155d02670661b540c3ab9b437baf5a6c579c7edb3b/ Download page] | * [https://elrc-share.eu/repository/browse/bilingual-corpus-from-the-publications-office-of-the-eu-on-the-medical-domain-v2-en-nl/0795a5328ac411ea913100155d02670661b540c3ab9b437baf5a6c579c7edb3b/ Download page] | ||
==Bilingual corpus made out of PDF documents from the European Medicines Agency, (EMEA) (EN-NL).== | ==Bilingual corpus made out of PDF documents from the European Medicines Agency, (EMEA) (EN-NL).== <!--T:42--> | ||
<!--T:43--> | |||
EN-NL Bilingual corpus made out of PDF documents from the European Medicines Agency, (EMEA), https://www.ema.europa.eu, (February 2020). This corpus contains 762.433 translation units. | EN-NL Bilingual corpus made out of PDF documents from the European Medicines Agency, (EMEA), https://www.ema.europa.eu, (February 2020). This corpus contains 762.433 translation units. | ||
<!--T:44--> | |||
* Version 1.0 (2020) | * Version 1.0 (2020) | ||
* [https://elrc-share.eu/repository/browse/bilingual-corpus-made-out-of-pdf-documents-from-the-european-medicines-agency-emea-httpswwwemaeuropaeu-february-2020-en-nl/93284c8e862411ea913100155d026706d3313f47bec143cd98cc4ba1aa62b4b5/ Download page] | * [https://elrc-share.eu/repository/browse/bilingual-corpus-made-out-of-pdf-documents-from-the-european-medicines-agency-emea-httpswwwemaeuropaeu-february-2020-en-nl/93284c8e862411ea913100155d026706d3313f47bec143cd98cc4ba1aa62b4b5/ Download page] | ||
==MultiLing EN-NL== | ==MultiLing EN-NL== <!--T:45--> | ||
The multiLing data set is based on six English source texts which are translated into various languages. Four of them ( | |||
<!--T:46--> | |||
The multiLing data set is based on six English source texts which are translated into various languages. Four of them (texts 1-4) are news articles and the other two are (texts 5-6) sociological texts from an encyclopedia. The Dutch data consists of two parts. ENDU20: Ten Dutch translations of the multiLing set by ten native Dutch recent master's degree translators and ENDU20-MT: Two Dutch machine translations of the multiLing set by DeepL (P20) and Google Translate (P21). | |||
<!--T:47--> | |||
* [https://lt3.ugent.be/resources/multiling-en-nl/ Project information and download instructions] | * [https://lt3.ugent.be/resources/multiling-en-nl/ Project information and download instructions] | ||
* [https://sites.google.com/site/centretranslationinnovation/tpr-db/public-studies#h.p_iVVuCQOHJx2O MultiLing information] | * [https://sites.google.com/site/centretranslationinnovation/tpr-db/public-studies#h.p_iVVuCQOHJx2O MultiLing information] | ||
==Dutch Government Website Corpus== | ==Dutch Government Website Corpus== <!--T:48--> | ||
Parallel ( | |||
<!--T:49--> | |||
Parallel (EN-NL) corpus of 6.532 translation units. | |||
<!--T:50--> | |||
* [https://live.european-language-grid.eu/catalogue/corpus/2877/ European Language Grid page] | |||
==Dutch Parallel Corpus 2.0 (DPC2)== <!--T:51--> | |||
<!--T:53--> | |||
The Dutch Parallel Corpus 2.0 is a bidirectional parallel corpus of expert translations for Dutch-English and Dutch-French language pairs. The corpus is sentence-aligned, lemmatized and POS-tagged using the state-of-the-art natural language processing toolkit Stanza. The corpus currently contains 2.7 million words, but is dynamic in nature. | |||
<!--T:52--> | |||
* [https://dpc2.ugent.be/ Access page] | |||
</translate> |
Latest revision as of 06:51, 20 June 2024
EDGeS Diachronic Bible Corpus
A diachronically and synchronically parallel corpus of Bible translations in Dutch, English, German and Swedish, with texts from the 14th century until today.
PacoMT Parallel Corpora
During the STEVIN project PaCo-MT (Parse and Corpus-based Machine Translation), two existing parallel corpora were enriched with syntactic annotations and node alignments. The annotations were generated automatically.
Language Pairs: English to Dutch, Dutch to English, French to Dutch, Dutch to French.
- version 1.0
- data set from 2014
- 38.8 MB
- Download page
- Project website
The Dutch Parallel Corpus
The Dutch Parallel Corpus (DPC) is a 10-million-word, sentence-aligned parallel corpus for the language pairs Dutch-English and Dutch-French, with Dutch as the central language.
The corpus contains five different text types and is balanced with respect to text type and translation direction. The entire corpus has been aligned at sentence level and further enriched with linguistic information (lemmas and PoS-tags). A small subset of the Dutch-English part has also been manually aligned at the sub-sentential level.
The Open Parallel Corpus (OPUS)
The OPUS corpus contains a very large collection of parallel corpora, amongst which many contain Dutch.
COVID-19 Corpora
COVID-19 - HEALTH Wikipedia dataset. Bilingual (EN-NL)
Bilingual (EN-NL) corpus acquired from Wikipedia on health and COVID-19 domain (2nd May 2020). The corpus contains 931 translation units.
- Version 1.0 (2020)
- Download page
COVID-19 ANTIBIOTIC dataset. Bilingual (EN-NL)
Bilingual (EN-NL) corpus acquired from the website https://antibiotic.ecdc.europa.eu/. The corpus contains 805 translation units.
- Version 1.0 (2020)
- Download page
COVID-19 EC-EUROPA v1 dataset. Bilingual (EN-NL)
Bilingual (EN-NL) corpus acquired from website (https://ec.europa.eu/*coronavirus-response) of the EU portal (20th May 2020). This corpus contains 2.391 translation units.
- Version 1.0 (2020)
- Download page
COVID-19 EU presscorner v2 dataset. Bilingual (EN-NL)
Bilingual (EN-NL) corpus acquired from website (https://ec.europa.eu/commission/presscorner/) of the EU portal (8th July 2020). This corpus contains 6.810 translation units.
- Version 2.0 (2020)
- Download page
COVID-19 EUR-LEX dataset. Βilingual (EN-NL)
Bilingual (EN-NL) corpus acquired from website (https://eur-lex.europa.eu/legal-content) of the EU portal (9th July 2020). This corpus contains 22.470 translation units.
- Version 1.0 (2020)
- Download page
COVID-19 EUROPARL v2 dataset. Bilingual (EN-NL)
Bilingual (EN-NL) corpus acquired from the website (https://www.europarl.europa.eu/) of the European Parliament (9th May 2020). This corpus contains 887 translation units.
- Version 2.0 (2020)
- Download page
COVID-19 Parallel Global Voices dataset. Bilingual (EN-NL)
EN-NL Bilingual COVID-19-related corpus acquired from the website (https://globalvoices.org/) of GlobalVoices (28th April 2020). This corpus contains 675 translation units.
- Version 1.0 (2020)
- Download page
Bilingual corpus from the European Vaccination Portal (NL-EN)
NL-EN Bilingual corpus acquired from https://vaccination-info.eu. This corpus contains 494 translation units.
- Version 1.0 (2020)
- Download page
Bilingual corpus from the Publications Office of the EU on the medical domain v.2 (EN-NL)
EN-NL Bilingual corpus extracted from the Publications Office of the EU on the medical domain. These are sourced from laws, studies, EC announcements, etc. labelled with concepts like epidemiology, epidemic, disease surveillance, health control, public hygiene, freedom of movement, distance learning, etc. This corpus contains 13.191 translation units.
- Version 2.0 (2020)
- Download page
Bilingual corpus made out of PDF documents from the European Medicines Agency, (EMEA) (EN-NL).
EN-NL Bilingual corpus made out of PDF documents from the European Medicines Agency, (EMEA), https://www.ema.europa.eu, (February 2020). This corpus contains 762.433 translation units.
- Version 1.0 (2020)
- Download page
MultiLing EN-NL
The multiLing data set is based on six English source texts which are translated into various languages. Four of them (texts 1-4) are news articles and the other two are (texts 5-6) sociological texts from an encyclopedia. The Dutch data consists of two parts. ENDU20: Ten Dutch translations of the multiLing set by ten native Dutch recent master's degree translators and ENDU20-MT: Two Dutch machine translations of the multiLing set by DeepL (P20) and Google Translate (P21).
Dutch Government Website Corpus
Parallel (EN-NL) corpus of 6.532 translation units.
Dutch Parallel Corpus 2.0 (DPC2)
The Dutch Parallel Corpus 2.0 is a bidirectional parallel corpus of expert translations for Dutch-English and Dutch-French language pairs. The corpus is sentence-aligned, lemmatized and POS-tagged using the state-of-the-art natural language processing toolkit Stanza. The corpus currently contains 2.7 million words, but is dynamic in nature.