Other corpora/nl: Difference between revisions
No edit summary |
(Updating to match new version of source page) |
||
Line 19: | Line 19: | ||
* [https://zenodo.org/record/4639616#.Ya4sX9DMLZR Download pagina] | * [https://zenodo.org/record/4639616#.Ya4sX9DMLZR Download pagina] | ||
<div lang="en" dir="ltr" class="mw-content-ltr"> | |||
==CONDIV-corpus== | ==CONDIV-corpus== | ||
The CONDIV-corpus was specifically designed to study the convergence or divergence (hence the name) between Netherlandic Dutch and Belgian Dutch. It contains a synchronous and a diachronous part. To get access to the data, you need to contact [https://www.kuleuven.be/wieiswie/nl/person/00013279 Dirk Speelman at KU Leuven] | The CONDIV-corpus was specifically designed to study the convergence or divergence (hence the name) between Netherlandic Dutch and Belgian Dutch. It contains a synchronous and a diachronous part. To get access to the data, you need to contact [https://www.kuleuven.be/wieiswie/nl/person/00013279 Dirk Speelman at KU Leuven] | ||
</div> | |||
<div lang="en" dir="ltr" class="mw-content-ltr"> | |||
* [https://neon.niederlandistik.fu-berlin.de/static/digitaal/digitaal-11.html Corpus website] | * [https://neon.niederlandistik.fu-berlin.de/static/digitaal/digitaal-11.html Corpus website] | ||
</div> | |||
<div lang="en" dir="ltr" class="mw-content-ltr"> | |||
==COREA-coreferentiecorpus== | ==COREA-coreferentiecorpus== | ||
The COREA Coreference Corpus is a corpus of Dutch texts annotated with corerefence relations. | The COREA Coreference Corpus is a corpus of Dutch texts annotated with corerefence relations. | ||
</div> | |||
<div lang="en" dir="ltr" class="mw-content-ltr"> | |||
*version 1.0.1 (2014) | *version 1.0.1 (2014) | ||
*[https://taalmaterialen.ivdnt.org/wp-content/uploads/documentatie/corea_lrec08_en.pdf Paper] | *[https://taalmaterialen.ivdnt.org/wp-content/uploads/documentatie/corea_lrec08_en.pdf Paper] | ||
*[https://corea.tst-centrale.org/ Demo] | *[https://corea.tst-centrale.org/ Demo] | ||
*[http://hdl.handle.net/10032/tm-a2-f9 Download page] | *[http://hdl.handle.net/10032/tm-a2-f9 Download page] | ||
</div> | |||
<div lang="en" dir="ltr" class="mw-content-ltr"> | |||
==D-Tuna-corpus== | ==D-Tuna-corpus== | ||
The D-TUNA Corpus consists of 2400 written and (transcribed) spoken referential expressions. | The D-TUNA Corpus consists of 2400 written and (transcribed) spoken referential expressions. | ||
</div> | |||
<div lang="en" dir="ltr" class="mw-content-ltr"> | |||
*version 1.0 (2009) | *version 1.0 (2009) | ||
*[https://taalmaterialen.ivdnt.org/wp-content/uploads/documentatie/dtuna_documentatie_en.pdf Paper] | *[https://taalmaterialen.ivdnt.org/wp-content/uploads/documentatie/dtuna_documentatie_en.pdf Paper] | ||
*[http://hdl.handle.net/10032/tm-a2-k5 Download page] | *[http://hdl.handle.net/10032/tm-a2-k5 Download page] | ||
</div> | |||
<div lang="en" dir="ltr" class="mw-content-ltr"> | |||
==DBRD== | ==DBRD== | ||
The DBRD (pronounced dee-bird) dataset contains over 110k book reviews of which 22k have associated binary sentiment polarity labels. It is intended as a benchmark for sentiment classification in Dutch. The dataset can be used to train a model for sequence modeling, more specifically language modeling and it can be used to train a model for text classification, more specifically sentiment classification, using the provided positive/negative sentiment polarity labels. | The DBRD (pronounced dee-bird) dataset contains over 110k book reviews of which 22k have associated binary sentiment polarity labels. It is intended as a benchmark for sentiment classification in Dutch. The dataset can be used to train a model for sequence modeling, more specifically language modeling and it can be used to train a model for text classification, more specifically sentiment classification, using the provided positive/negative sentiment polarity labels. | ||
</div> | |||
<div lang="en" dir="ltr" class="mw-content-ltr"> | |||
*[https://benjaminvdb.github.io/DBRD/ Home page] | *[https://benjaminvdb.github.io/DBRD/ Home page] | ||
*[https://github.com/benjaminvdb/DBRD GitHub] | *[https://github.com/benjaminvdb/DBRD GitHub] | ||
</div> | |||
<div lang="en" dir="ltr" class="mw-content-ltr"> | |||
== Dutch Audio Description Corpus == | == Dutch Audio Description Corpus == | ||
The Dutch Audio Description corpus includes the transcribed texts of 39 audio described Dutch films and TV series, in total 154,570 words and 3,074 minutes of video. This Dutch AD corpus was used to extract a series of quantitative data regarding the language of AD, namely frequency counts of parts of speech, words, lemmas, collocations and the calculation of other relevant text statistics such as reading speed, word and sentence length, text readability and type token ratios (a statistical measure reflecting lexical variety). | The Dutch Audio Description corpus includes the transcribed texts of 39 audio described Dutch films and TV series, in total 154,570 words and 3,074 minutes of video. This Dutch AD corpus was used to extract a series of quantitative data regarding the language of AD, namely frequency counts of parts of speech, words, lemmas, collocations and the calculation of other relevant text statistics such as reading speed, word and sentence length, text readability and type token ratios (a statistical measure reflecting lexical variety). | ||
</div> | |||
<div lang="en" dir="ltr" class="mw-content-ltr"> | |||
* [https://zenodo.org/record/1035175#.YfP7IerMLZR Download page] | * [https://zenodo.org/record/1035175#.YfP7IerMLZR Download page] | ||
</div> | |||
<div lang="en" dir="ltr" class="mw-content-ltr"> | |||
==deLearyous== | ==deLearyous== | ||
The deLearyous dataset is a Dutch (Flemish) dataset for emotion classification following the framework of Leary's Rose, also known as the Interpersonal Circumplex. The dataset contains 11 conversations that were annotated on the sentence level with their position on Leary's Rose, in function of the two defining dimensions: "dominance", and "affinity". | The deLearyous dataset is a Dutch (Flemish) dataset for emotion classification following the framework of Leary's Rose, also known as the Interpersonal Circumplex. The dataset contains 11 conversations that were annotated on the sentence level with their position on Leary's Rose, in function of the two defining dimensions: "dominance", and "affinity". | ||
</div> | |||
<div lang="en" dir="ltr" class="mw-content-ltr"> | |||
* [https://zenodo.org/record/4643731#.YgKUSurMLZR Download page] | * [https://zenodo.org/record/4643731#.YgKUSurMLZR Download page] | ||
</div> | |||
<div lang="en" dir="ltr" class="mw-content-ltr"> | |||
==Dutch Idiom Database: Native Speakers (DID-NS)== | ==Dutch Idiom Database: Native Speakers (DID-NS)== | ||
</div> | |||
<div lang="en" dir="ltr" class="mw-content-ltr"> | |||
The DID-NS is a database with appreciations by 390 native speakers of 374 Dutch idiomatic expressions. | The DID-NS is a database with appreciations by 390 native speakers of 374 Dutch idiomatic expressions. | ||
</div> | |||
<div lang="en" dir="ltr" class="mw-content-ltr"> | |||
* Version 1.0 (2018) | * Version 1.0 (2018) | ||
*[https://taalmaterialen.ivdnt.org/wp-content/uploads/documentatie/Methodology.pdf Methodology] | *[https://taalmaterialen.ivdnt.org/wp-content/uploads/documentatie/Methodology.pdf Methodology] | ||
*[http://hdl.handle.net/10032/tm-a2-r7 Download page] | *[http://hdl.handle.net/10032/tm-a2-r7 Download page] | ||
</div> | |||
<div lang="en" dir="ltr" class="mw-content-ltr"> | |||
==NAMES Corpus == | ==NAMES Corpus == | ||
</div> | |||
The NAMES Corpus is a corpus of Dutch given names and surnames as present in 19th century certificates for birth, marriage and decease. The name variants have been assigned to a standard form. | <div lang="en" dir="ltr" class="mw-content-ltr"> | ||
The NAMES Corpus is a corpus of Dutch given names and surnames as present in 19th century certificates for birth, marriage and decease. The name variants have been assigned to a standard form. | |||
</div> | |||
<div lang="en" dir="ltr" class="mw-content-ltr"> | |||
* Version 1.1 (2020) | * Version 1.1 (2020) | ||
*[http://hdl.handle.net/10032/tm-a2-r6 Download page] | *[http://hdl.handle.net/10032/tm-a2-r6 Download page] | ||
*[https://taalmaterialen.ivdnt.org/wp-content/uploads/documentatie/NAMES-corpus-1.1-manual.pdf Documentation] | *[https://taalmaterialen.ivdnt.org/wp-content/uploads/documentatie/NAMES-corpus-1.1-manual.pdf Documentation] | ||
</div> | |||
<div lang="en" dir="ltr" class="mw-content-ltr"> | |||
==Personae Corpus== | ==Personae Corpus== | ||
The Personae corpus was collected for experiments in Authorship Attribution and Personality Prediction. It consists of 145 Dutch-language essays, written by 145 different students (BA in Linguistics and Literature at the University of Antwerp, Belgium). Each student also took an online MBTI personality test, allowing personality prediction experiments. The corpus was controlled for topic, register, genre, age, and education level. The original texts, a syntactically annotated version of the texts, and the metadata are available. | The Personae corpus was collected for experiments in Authorship Attribution and Personality Prediction. It consists of 145 Dutch-language essays, written by 145 different students (BA in Linguistics and Literature at the University of Antwerp, Belgium). Each student also took an online MBTI personality test, allowing personality prediction experiments. The corpus was controlled for topic, register, genre, age, and education level. The original texts, a syntactically annotated version of the texts, and the metadata are available. | ||
</div> | |||
<div lang="en" dir="ltr" class="mw-content-ltr"> | |||
*[https://zenodo.org/record/4643756#.Yl6GBehBzZQ Download page] | *[https://zenodo.org/record/4643756#.Yl6GBehBzZQ Download page] | ||
</div> | |||
<div lang="en" dir="ltr" class="mw-content-ltr"> | |||
==JASMIN-BLISS-Negation== | ==JASMIN-BLISS-Negation== | ||
A corpus sample of Dutch human-computer dialogues annotated with negation cues. | A corpus sample of Dutch human-computer dialogues annotated with negation cues. | ||
</div> | |||
<div lang="en" dir="ltr" class="mw-content-ltr"> | |||
*[https://github.com/LanguageMachines/JASMIN-BLISS-Negation Webpage] | *[https://github.com/LanguageMachines/JASMIN-BLISS-Negation Webpage] | ||
</div> | |||
<div lang="en" dir="ltr" class="mw-content-ltr"> | |||
== Multimodal ABEA == | == Multimodal ABEA == | ||
Multimodal dataset that can be used in the context of aspect-based sentiment and emotion detection. It consists of 4,900 comments on 175 images from the Adidas Instagram page and is annotated with both aspect and emotion labels. | Multimodal dataset that can be used in the context of aspect-based sentiment and emotion detection. It consists of 4,900 comments on 175 images from the Adidas Instagram page and is annotated with both aspect and emotion labels. | ||
</div> | |||
<div lang="en" dir="ltr" class="mw-content-ltr"> | |||
*[https://lt3.ugent.be/resources/multimodal-abea/ Information page] | *[https://lt3.ugent.be/resources/multimodal-abea/ Information page] | ||
*[https://lt3.ugent.be/media/uploads/tools/Dataset.zip Download] | *[https://lt3.ugent.be/media/uploads/tools/Dataset.zip Download] | ||
</div> | |||
<div lang="en" dir="ltr" class="mw-content-ltr"> | |||
==MFAQ (Multilingual corpus of Frequently Asked Questions)== | ==MFAQ (Multilingual corpus of Frequently Asked Questions)== | ||
Parsed from the [https://commoncrawl.org/ Common Crawl]. The corpus contains 6 million pairs of questions and answers in 21 different languages. | Parsed from the [https://commoncrawl.org/ Common Crawl]. The corpus contains 6 million pairs of questions and answers in 21 different languages. | ||
*[https://www.uantwerpen.be/en/research-groups/clips/research/datasets/ Webpage] | *[https://www.uantwerpen.be/en/research-groups/clips/research/datasets/ Webpage] | ||
*[https://aclanthology.org/2021.mrqa-1.1 Paper] | *[https://aclanthology.org/2021.mrqa-1.1 Paper] | ||
</div> | |||
<div lang="en" dir="ltr" class="mw-content-ltr"> | |||
==VaccinChatNL== | ==VaccinChatNL== | ||
A Belgian Dutch FAQ dataset on the topic of COVID-19 vaccinations in Flanders. It consists of 12,833 user questions divided over 181 answer labels, thus providing large groups of semantically equivalent paraphrases (a many-to-one mapping of user questions to answer labels). VaccinChatNL is the first Dutch many-to-one FAQ dataset of this size. | A Belgian Dutch FAQ dataset on the topic of COVID-19 vaccinations in Flanders. It consists of 12,833 user questions divided over 181 answer labels, thus providing large groups of semantically equivalent paraphrases (a many-to-one mapping of user questions to answer labels). VaccinChatNL is the first Dutch many-to-one FAQ dataset of this size. | ||
*[https://www.uantwerpen.be/en/research-groups/clips/research/datasets/ Webpage] | *[https://www.uantwerpen.be/en/research-groups/clips/research/datasets/ Webpage] | ||
*[https://aclanthology.org/2022.coling-1.312 Paper] | *[https://aclanthology.org/2022.coling-1.312 Paper] | ||
</div> | |||
<div lang="en" dir="ltr" class="mw-content-ltr"> | |||
==MQA (Multilingual corpus of Questions and Answers)== | ==MQA (Multilingual corpus of Questions and Answers)== | ||
Parsed from the [https://commoncrawl.org/ Common Crawl]. The corpus contains 234 million pairs of questions and answers in 39 languages. | Parsed from the [https://commoncrawl.org/ Common Crawl]. The corpus contains 234 million pairs of questions and answers in 39 languages. | ||
</div> | |||
<div lang="en" dir="ltr" class="mw-content-ltr"> | |||
*[https://huggingface.co/datasets/clips/mqa Webpage] | *[https://huggingface.co/datasets/clips/mqa Webpage] | ||
*[https://aclanthology.org/2021.mrqa-1.1 Paper] | *[https://aclanthology.org/2021.mrqa-1.1 Paper] | ||
</div> | |||
<div lang="en" dir="ltr" class="mw-content-ltr"> | |||
==Dutch Audio Description Corpus== | ==Dutch Audio Description Corpus== | ||
The Dutch Audio Description corpus includes the transcribed texts of 39 audio described Dutch films and TV series, in total 154,570 words and 3,074 minutes of video. The data include the corpus files (XML-files) of the transcribed audio descriptions, the multimodal concordancer developed for the project and the raw data extracted from the corpus as part of the PHD project during which this corpus was developed. | The Dutch Audio Description corpus includes the transcribed texts of 39 audio described Dutch films and TV series, in total 154,570 words and 3,074 minutes of video. The data include the corpus files (XML-files) of the transcribed audio descriptions, the multimodal concordancer developed for the project and the raw data extracted from the corpus as part of the PHD project during which this corpus was developed. | ||
</div> | |||
<div lang="en" dir="ltr" class="mw-content-ltr"> | |||
*[https://doi.org/10.5281/zenodo.1035175 Webpage] | *[https://doi.org/10.5281/zenodo.1035175 Webpage] | ||
*[https://doi.org/10.5281/zenodo.1035175 Paper] | *[https://doi.org/10.5281/zenodo.1035175 Paper] | ||
</div> | |||
<div lang="en" dir="ltr" class="mw-content-ltr"> | |||
==Named Entity Recognition CoNLL2002== | ==Named Entity Recognition CoNLL2002== | ||
Spanish and Dutch data with named entity labels. The Dutch data consist of four editions of the Belgian newspaper "De Morgen" of 2000 (June 2, July 1, August 1 and September 1). For the Dutch data, the annotator has followed the MITRE and SAIC guidelines for named entity recognition (Chinchor et al., 1999) as well as possible. | Spanish and Dutch data with named entity labels. The Dutch data consist of four editions of the Belgian newspaper "De Morgen" of 2000 (June 2, July 1, August 1 and September 1). For the Dutch data, the annotator has followed the MITRE and SAIC guidelines for named entity recognition (Chinchor et al., 1999) as well as possible. | ||
*[https://huggingface.co/datasets/conll2002 CoNLL2002 Dataset] | *[https://huggingface.co/datasets/conll2002 CoNLL2002 Dataset] | ||
</div> | |||
<div lang="en" dir="ltr" class="mw-content-ltr"> | |||
==CC-100 Corpus== | ==CC-100 Corpus== | ||
This corpus is an attempt to recreate the dataset used for training XLM-R. This corpus comprises of monolingual data for 100+ languages and also includes data for romanized languages (indicated by *_rom). This was constructed using the urls and paragraph indices provided by the CC-Net repository by processing January-December 2018 Commoncrawl snapshots. Each file comprises of documents separated by double-newlines and paragraphs within the same document separated by a newline. The data is generated using the open source CC-Net repository. No claims of intellectual property are made on the work of preparation of the corpus. Dutch is one of the languages. | This corpus is an attempt to recreate the dataset used for training XLM-R. This corpus comprises of monolingual data for 100+ languages and also includes data for romanized languages (indicated by *_rom). This was constructed using the urls and paragraph indices provided by the CC-Net repository by processing January-December 2018 Commoncrawl snapshots. Each file comprises of documents separated by double-newlines and paragraphs within the same document separated by a newline. The data is generated using the open source CC-Net repository. No claims of intellectual property are made on the work of preparation of the corpus. Dutch is one of the languages. | ||
</div> | |||
<div lang="en" dir="ltr" class="mw-content-ltr"> | |||
*[https://data.statmt.org/cc-100/ Corpus website with download links per language] | *[https://data.statmt.org/cc-100/ Corpus website with download links per language] | ||
*[https://www.aclweb.org/anthology/2020.acl-main.747 Paper webpage] | *[https://www.aclweb.org/anthology/2020.acl-main.747 Paper webpage] | ||
*[https://aclanthology.org/2020.lrec-1.494/ Paper webpage] | *[https://aclanthology.org/2020.lrec-1.494/ Paper webpage] | ||
</div> | |||
<div lang="en" dir="ltr" class="mw-content-ltr"> | |||
== Dutch Gigacorpus == | == Dutch Gigacorpus == | ||
With 234GB of varied plain text, and no fewer than 40 billion tokens, this is in any case one of the largest Dutch corpora. This corpus is also freely available and the quality is relatively high for its size, care has been taken to ensure that the data is as clean as possible. Also, the corpus contains 400 million forum posts in 10 million threads with their timestamp intact for linguistic research. | With 234GB of varied plain text, and no fewer than 40 billion tokens, this is in any case one of the largest Dutch corpora. This corpus is also freely available and the quality is relatively high for its size, care has been taken to ensure that the data is as clean as possible. Also, the corpus contains 400 million forum posts in 10 million threads with their timestamp intact for linguistic research. | ||
</div> | |||
<div lang="en" dir="ltr" class="mw-content-ltr"> | |||
* [http://gigacorpus.nl/ Project website] | * [http://gigacorpus.nl/ Project website] | ||
* | * | ||
</div> |
Revision as of 11:31, 21 March 2024
BasiLex-corpus
Het BasiLex-corpus is een geannoteerde verzameling van teksten geschreven voor kinderen in de basisschoolleeftijd. Het corpus bevat 13,5 miljoen tokens, waarvan 11,5 miljoen woorden. De tokens komen voor ongeveer 40% uit educatieve materialen, 40% uit kinderliteratuur en 20% uit media.
- versie 1.0 (2015)
- Tellings, A., Hulsbosch, M., Vermeer, A. & van den Bosch, A. (2015). BasiLex: an 11.5-million words corpus of Dutch texts written for children. Computational Linguistics in the Netherlands Journal 4, 191-208
- Download pagina
BasiScript-corpus
BasiScript is een corpus met 9 miljoen woorden geschreven tekst geproduceerd door leerlingen van de Nederlandse basisschool.
- versie 1.0 (2015)
- Project pagina
- Download pagina
CLiPS Stylometry Investigation (CSI) Corpus
Het CSI-corpus is een jaarlijks uitgebreid corpus van studententeksten in twee genres: essays en reviews. Het doel van dit corpus bevindt zich met name in onderzoek naar stylometrie, maar andere toepassingen zijn ook mogelijk. Er is een uitgebreide hoeveelheid metadata beschikbaar, zowel over de auteur (gender, leeftijd, seksuele oriëntatie, regio van oorsprong, persoonlijkheidsprofiel), als ook over het document (tijdsvermelding, genre, echtheid, sentiment, cijfer). De huidige versie van het corpus is samengesteld in februari 2016. Eerdere versies van het corpus zijn verkrijgbaar bij de auteurs via emailaanvraag.
CONDIV-corpus
The CONDIV-corpus was specifically designed to study the convergence or divergence (hence the name) between Netherlandic Dutch and Belgian Dutch. It contains a synchronous and a diachronous part. To get access to the data, you need to contact Dirk Speelman at KU Leuven
COREA-coreferentiecorpus
The COREA Coreference Corpus is a corpus of Dutch texts annotated with corerefence relations.
- version 1.0.1 (2014)
- Paper
- Demo
- Download page
D-Tuna-corpus
The D-TUNA Corpus consists of 2400 written and (transcribed) spoken referential expressions.
- version 1.0 (2009)
- Paper
- Download page
DBRD
The DBRD (pronounced dee-bird) dataset contains over 110k book reviews of which 22k have associated binary sentiment polarity labels. It is intended as a benchmark for sentiment classification in Dutch. The dataset can be used to train a model for sequence modeling, more specifically language modeling and it can be used to train a model for text classification, more specifically sentiment classification, using the provided positive/negative sentiment polarity labels.
Dutch Audio Description Corpus
The Dutch Audio Description corpus includes the transcribed texts of 39 audio described Dutch films and TV series, in total 154,570 words and 3,074 minutes of video. This Dutch AD corpus was used to extract a series of quantitative data regarding the language of AD, namely frequency counts of parts of speech, words, lemmas, collocations and the calculation of other relevant text statistics such as reading speed, word and sentence length, text readability and type token ratios (a statistical measure reflecting lexical variety).
deLearyous
The deLearyous dataset is a Dutch (Flemish) dataset for emotion classification following the framework of Leary's Rose, also known as the Interpersonal Circumplex. The dataset contains 11 conversations that were annotated on the sentence level with their position on Leary's Rose, in function of the two defining dimensions: "dominance", and "affinity".
Dutch Idiom Database: Native Speakers (DID-NS)
The DID-NS is a database with appreciations by 390 native speakers of 374 Dutch idiomatic expressions.
- Version 1.0 (2018)
- Methodology
- Download page
NAMES Corpus
The NAMES Corpus is a corpus of Dutch given names and surnames as present in 19th century certificates for birth, marriage and decease. The name variants have been assigned to a standard form.
- Version 1.1 (2020)
- Download page
- Documentation
Personae Corpus
The Personae corpus was collected for experiments in Authorship Attribution and Personality Prediction. It consists of 145 Dutch-language essays, written by 145 different students (BA in Linguistics and Literature at the University of Antwerp, Belgium). Each student also took an online MBTI personality test, allowing personality prediction experiments. The corpus was controlled for topic, register, genre, age, and education level. The original texts, a syntactically annotated version of the texts, and the metadata are available.
JASMIN-BLISS-Negation
A corpus sample of Dutch human-computer dialogues annotated with negation cues.
Multimodal ABEA
Multimodal dataset that can be used in the context of aspect-based sentiment and emotion detection. It consists of 4,900 comments on 175 images from the Adidas Instagram page and is annotated with both aspect and emotion labels.
MFAQ (Multilingual corpus of Frequently Asked Questions)
Parsed from the Common Crawl. The corpus contains 6 million pairs of questions and answers in 21 different languages.
VaccinChatNL
A Belgian Dutch FAQ dataset on the topic of COVID-19 vaccinations in Flanders. It consists of 12,833 user questions divided over 181 answer labels, thus providing large groups of semantically equivalent paraphrases (a many-to-one mapping of user questions to answer labels). VaccinChatNL is the first Dutch many-to-one FAQ dataset of this size.
MQA (Multilingual corpus of Questions and Answers)
Parsed from the Common Crawl. The corpus contains 234 million pairs of questions and answers in 39 languages.
Dutch Audio Description Corpus
The Dutch Audio Description corpus includes the transcribed texts of 39 audio described Dutch films and TV series, in total 154,570 words and 3,074 minutes of video. The data include the corpus files (XML-files) of the transcribed audio descriptions, the multimodal concordancer developed for the project and the raw data extracted from the corpus as part of the PHD project during which this corpus was developed.
Named Entity Recognition CoNLL2002
Spanish and Dutch data with named entity labels. The Dutch data consist of four editions of the Belgian newspaper "De Morgen" of 2000 (June 2, July 1, August 1 and September 1). For the Dutch data, the annotator has followed the MITRE and SAIC guidelines for named entity recognition (Chinchor et al., 1999) as well as possible.
CC-100 Corpus
This corpus is an attempt to recreate the dataset used for training XLM-R. This corpus comprises of monolingual data for 100+ languages and also includes data for romanized languages (indicated by *_rom). This was constructed using the urls and paragraph indices provided by the CC-Net repository by processing January-December 2018 Commoncrawl snapshots. Each file comprises of documents separated by double-newlines and paragraphs within the same document separated by a newline. The data is generated using the open source CC-Net repository. No claims of intellectual property are made on the work of preparation of the corpus. Dutch is one of the languages.
Dutch Gigacorpus
With 234GB of varied plain text, and no fewer than 40 billion tokens, this is in any case one of the largest Dutch corpora. This corpus is also freely available and the quality is relatively high for its size, care has been taken to ensure that the data is as clean as possible. Also, the corpus contains 400 million forum posts in 10 million threads with their timestamp intact for linguistic research.