Parallel corpora

From Clarin K-Centre
Revision as of 08:39, 21 April 2022 by Griet (talk | contribs)
Jump to navigation Jump to search

Parallel corpora are central to translation studies and contrastive linguistics. Many of the parallel corpora are accessible through easy-to-use concordancers which considerably facilitates the study of interlinguistic phenomena. Such corpora are also a rich source of materials for language teaching. Furthermore, parallel corpora serve as training data for statistical machine translation systems.

DAESO Corpus

The DAESO Corpus is a parallel monolingual treebank of Dutch texts and the corpus contains more than 2.1 million words of parallel and comparable text. About 678,000 words were lined up manually and about 1.5 million words were automatically aligned. A semantic relation was added to the aligned words / phrases.

Bible Corpus

A diachronically and synchronically parallel corpus of Bible translations in Dutch,English, German and Swedish, with texts from the 14th century until today.

PacoMT Parallel Corpora

During the STEVIN project PaCo-MT (Parse and Corpus-based Machine Translation), two existing parallel corpora were enriched with syntactic annotations and node alignments. The annotations were generated automatically.

Language Pairs: English to Dutch, Dutch to English, French to Dutch, Dutch to French.

The Dutch Parallel Corpus

The Dutch Parallel Corpus (DPC) is a 10-million-word, sentence-aligned parallel corpus for the language pairs Dutch-English and Dutch-French, with Dutch as the central language.

The corpus contains five different text types and is balanced with respect to text type and translation direction. The entire corpus has been aligned at sentence level and further enriched with linguistic information (lemmas and PoS-tags). A small subset of the Dutch-English part has also been manually aligned at the sub-sentential level.

The Open Parallel Corpus (OPUS)

The OPUS corpus contains a very large collection of parallel corpora, amongst which many contain Dutch.

COVID-19 - HEALTH Wikipedia dataset. Bilingual (EN-NL)

Bilingual (EN-NL) corpus acquired from Wikipedia on health and COVID-19 domain (2nd May 2020). The corpus contains 931 translation units.

COVID-19 ANTIBIOTIC dataset. Bilingual (EN-NL)

Bilingual (EN-NL) corpus acquired from the website https://antibiotic.ecdc.europa.eu/. The corpus contains 805 translation units.

COVID-19 EC-EUROPA v1 dataset. Bilingual (EN-NL)

Bilingual (EN-NL) corpus acquired from website (https://ec.europa.eu/*coronavirus-response) of the EU portal (20th May 2020). This corpus contains 2.391 translation units.

COVID-19 EU presscorner v2 dataset. Bilingual (EN-NL)

Bilingual (EN-NL) corpus acquired from website (https://ec.europa.eu/commission/presscorner/) of the EU portal (8th July 2020). This corpus contains 6.810 translation units.

COVID-19 EUR-LEX dataset. Βilingual (EN-NL)

Bilingual (EN-NL) corpus acquired from website (https://eur-lex.europa.eu/legal-content) of the EU portal (9th July 2020). This corpus contains 22.470 translation units.

COVID-19 EUROPARL v2 dataset. Bilingual (EN-NL)

Bilingual (EN-NL) corpus acquired from the website (https://www.europarl.europa.eu/) of the European Parliament (9th May 2020). This corpus contains 887 translation units.

COVID-19 Parallel Global Voices dataset. Bilingual (EN-NL)

EN-NL Bilingual COVID-19-related corpus acquired from the website (https://globalvoices.org/) of GlobalVoices (28th April 2020). This corpus contains 675 translation units.


Bilingual corpus from the European Vaccination Portal (NL-EN)

NL-EN Bilingual corpus acquired from https://vaccination-info.eu. This corpus contains 494 translation units.

Bilingual corpus from the Publications Office of the EU on the medical domain v.2 (EN-NL)

EEN-NL Bilingual corpus extracted from the Publications Office of the EU on the medical domain. These are sourced from laws, studies, EC announcements, etc. labelled with concepts like epidemiology, epidemic, disease surveillance, health control, public hygiene, freedom of movement, distance learning, etc. This corpus contains 13.191 translation units.

Bilingual corpus made out of PDF documents from the European Medicines Agency, (EMEA) (EN-NL).

EN-NL Bilingual corpus made out of PDF documents from the European Medicines Agency, (EMEA), https://www.ema.europa.eu, (February 2020). This corpus contains 762.433 translation units.