Parallel corpora: Difference between revisions
No edit summary |
No edit summary |
||
Line 32: | Line 32: | ||
*[http://hdl.handle.net/10032/tm-a2-h3 Download page] | *[http://hdl.handle.net/10032/tm-a2-h3 Download page] | ||
*[https://www.kuleuven-kulak.be/dpc/en/ Project website] | *[https://www.kuleuven-kulak.be/dpc/en/ Project website] | ||
==The Open Parallel Corpus (OPUS)== | |||
The [https://opus.nlpl.eu/ Opus corpus] contains a very large collection of parallel corpora, amongst which many contain Dutch. |
Revision as of 07:16, 7 July 2021
Parallel corpora are central to translation studies and contrastive linguistics. Many of the parallel corpora are accessible through easy-to-use concordancers which considerably facilitates the study of interlinguistic phenomena. Such corpora are also a rich source of materials for language teaching. Furthermore, parallel corpora serve as training data for statistical machine translation systems.
DAESO Corpus
The DAESO Corpus is a parallel monolingual treebank of Dutch texts and the corpus contains more than 2.1 million words of parallel and comparable text. About 678,000 words were lined up manually and about 1.5 million words were automatically aligned. A semantic relation was added to the aligned words / phrases.
- 92.5 MB
- version 1.0 (2010)
- Download page
Bible Corpus
Parallel Bible corpus, contains Bibles from the 14th century onwards
PacoMT Parallel Corpora
During the STEVIN project PaCo-MT (Parse and Corpus-based Machine Translation), two existing parallel corpora were enriched with syntactic annotations and node alignments. The annotations were generated automatically.
Language Pairs: English to Dutch, Dutch to English, French to Dutch, Dutch to French.
- version 1.0
- data set from 2014
- 38.8 MB
- Download page
- Project website
The Dutch Parallel Corpus
The Dutch Parallel Corpus (DPC) is a 10-million-word, sentence-aligned parallel corpus for the language pairs Dutch-English and Dutch-French, with Dutch as the central language.
The corpus contains five different text types and is balanced with respect to text type and translation direction. The entire corpus has been aligned at sentence level and further enriched with linguistic information (lemmas and PoS-tags). A small subset of the Dutch-English part has also been manually aligned at the sub-sentential level.
The Open Parallel Corpus (OPUS)
The Opus corpus contains a very large collection of parallel corpora, amongst which many contain Dutch.