Wordlists: Difference between revisions

From Clarin K-Centre
Jump to navigation Jump to search
No edit summary
No edit summary
 
(4 intermediate revisions by 3 users not shown)
Line 1: Line 1:
== Woordenlijst van de Nederlandse Taal ==
<languages/>


<translate>
== Woordenlijst van de Nederlandse Taal == <!--T:1-->
<!--T:2-->
Since 1804, our spelling has been fixed by the government. This includes basic principles and specific rules, such as those for spelling vowels and consonants, the use of capitals and characters (accents, hyphens, punctuation marks and apostrophes), the spelling of compounds with a middle sound (pancake, briefcase) and the division of words into syllables. In addition, the government publishes a list of words that are spelled according to the rules and others that are difficult to derive from rules, for example words that we adopt from other languages.
Since 1804, our spelling has been fixed by the government. This includes basic principles and specific rules, such as those for spelling vowels and consonants, the use of capitals and characters (accents, hyphens, punctuation marks and apostrophes), the spelling of compounds with a middle sound (pancake, briefcase) and the division of words into syllables. In addition, the government publishes a list of words that are spelled according to the rules and others that are difficult to derive from rules, for example words that we adopt from other languages.


<!--T:3-->
At the end of 2015, the Woordenlijst van de Nederlandse Taal contained over 180,000 keywords. In the online version provided with the Woordenlijst (woordenlijst.org) these words can all be found, amply provided with data on hyphenation, inflection and conjugation.
At the end of 2015, the Woordenlijst van de Nederlandse Taal contained over 180,000 keywords. In the online version provided with the Woordenlijst (woordenlijst.org) these words can all be found, amply provided with data on hyphenation, inflection and conjugation.


<!--T:4-->
* [https://woordenlijst.org/#/ Woordenlijst.org]
* [https://woordenlijst.org/#/ Woordenlijst.org]


<!--T:5-->
== Subtlex NL ==
== Subtlex NL ==
SUBTLEX-NL is a database of Dutch word frequencies based on 44 million words from film and television subtitles.
SUBTLEX-NL is a database of Dutch word frequencies based on 44 million words from film and television subtitles.


<!--T:6-->
* [http://crr.ugent.be/programs-data/subtitle-frequencies/subtlex-nl Project page]
* [http://crr.ugent.be/programs-data/subtitle-frequencies/subtlex-nl Project page]
* [https://osf.io/3d8cx/ Download page]
* [https://osf.io/3d8cx/ Download page]
* Reference: Keuleers, E., Brysbaert, M. & New, B. (2010). SUBTLEX-NL: A new frequency measure for Dutch words based on film subtitles. Behavior Research Methods, 42(3), 643-650.
* Reference: Keuleers, E., Brysbaert, M. & New, B. (2010). SUBTLEX-NL: A new frequency measure for Dutch words based on film subtitles. Behavior Research Methods, 42(3), 643-650.


<!--T:7-->
==CombiLex==
==CombiLex==
CombiLex is a list of Dutch lemmas and word forms without further annotation. The lexicon contains over 213.000 unique lemmas and over 442.000 unique lemmas and word forms.
CombiLex is a list of Dutch lemmas and word forms without further annotation. The lexicon contains over 213.000 unique lemmas and over 442.000 unique lemmas and word forms.


<!--T:8-->
* [https://taalmaterialen.ivdnt.org/wp-content/uploads/documentatie/clex_documentatie_en.pdf Documentation]  
* [https://taalmaterialen.ivdnt.org/wp-content/uploads/documentatie/clex_documentatie_en.pdf Documentation]  
* [http://hdl.handle.net/10032/tm-a2-k2 Download page]
* [http://hdl.handle.net/10032/tm-a2-k2 Download page]


<!--T:9-->
== INT Historische Woordenlijst ==
== INT Historische Woordenlijst ==
The INT Historical Wordlist consists of 2 lists with each ca. 500.000 historical word forms for the purpose of OCR and OCR post-correction for the period ca. 1550 - ca. 1970.  
The INT Historical Wordlist consists of 2 lists with each ca. 500.000 historical word forms for the purpose of OCR and OCR post-correction for the period ca. 1550 - ca. 1970.  


* [http://www.impact-project.eu/ Project page]
<!--T:10-->
* [https://cordis.europa.eu/project/id/215064 Project information]
* [http://hdl.handle.net/10032/tm-a2-a6 Download page]
* [http://hdl.handle.net/10032/tm-a2-a6 Download page]
* [https://ivdnt.org/images/stories/producten/Does-Depuydt-2012_v6.pdf Evaluation paper]
* [https://ivdnt.org/images/stories/producten/Does-Depuydt-2012_v6.pdf Evaluation paper]


== CHN N-grams ==
== CHN N-grams == <!--T:11-->


<!--T:12-->
N-grams (lengths one, two, and three) and their frequencies from the Corpus of Contemporary Dutch (CHN).
N-grams (lengths one, two, and three) and their frequencies from the Corpus of Contemporary Dutch (CHN).


<!--T:13-->
* Version 1.0 (2019)
* Version 1.0 (2019)
* [http://hdl.handle.net/10032/tm-a2-p6 Download page]
* [http://hdl.handle.net/10032/tm-a2-p6 Download page]


<!--T:14-->
== Middle Dutch syllabified words==
== Middle Dutch syllabified words==
This wordlist contains 43,710 syllabified Middle Dutch words, which is the total amount of unique words from the Corpus Van Reenen-Mulder. This corpus, created by Pieter van Reenen en Maaike Mulder at the Free University Amsterdam, contains about 2,500 Middle Dutch charters. It has about 750,000 tokens. The charters were written in the Netherlands and Flanders between 1300 and 1400.
This wordlist contains 43,710 syllabified Middle Dutch words, which is the total amount of unique words from the Corpus Van Reenen-Mulder. This corpus, created by Pieter van Reenen en Maaike Mulder at the Free University Amsterdam, contains about 2,500 Middle Dutch charters. It has about 750,000 tokens. The charters were written in the Netherlands and Flanders between 1300 and 1400.


<!--T:15-->
*[https://zenodo.org/record/2402048#.YjikzjXvJNZ Download page]
*[https://zenodo.org/record/2402048#.YjikzjXvJNZ Download page]


<!--T:16-->
== RND Woordenlijsten ==
== RND Woordenlijsten ==
The RND Word Lists contain phonetic transcriptions of dialect words collected in the Netherlands and Belgium. They were originally published in the "Reeks Nederlandse Dialectatlassen".  
The RND Word Lists contain phonetic transcriptions of dialect words collected in the Netherlands and Belgium. They were originally published in the "Reeks Nederlandse Dialectatlassen".  


<!--T:17-->
* Version 1.1 (2021)
* Version 1.1 (2021)
* [https://taalmaterialen.ivdnt.org/wp-content/uploads/documentatie/info.pdf Documentation (in Dutch)]
* [https://taalmaterialen.ivdnt.org/wp-content/uploads/documentatie/info.pdf Documentation (in Dutch)]
* [http://hdl.handle.net/10032/tm-a2-t6 Download page]
* [http://hdl.handle.net/10032/tm-a2-t6 Download page]


<!--T:18-->
== Cognates NL-EN-FR==
== Cognates NL-EN-FR==
Gold Standard for Cognate Pairs in English-Dutch and French-Dutch.  
Gold Standard for Cognate Pairs in English-Dutch and French-Dutch.  
Reference: Labat, S. and Lefever, E. (2020). Gold Standard for Cognate Pairs in English-Dutch and French-Dutch. LT3, Ghent University, 1.0, ISLRN 288-099-424-255-6
Reference: Labat, S. and Lefever, E. (2020). Gold Standard for Cognate Pairs in English-Dutch and French-Dutch. LT3, Ghent University, 1.0, ISLRN 288-099-424-255-6


<!--T:19-->
* [https://lt3.ugent.be/resources/cognates-nl-fr-en/ Website]
* [https://lt3.ugent.be/resources/cognates-nl-fr-en/ Website]
* [https://lt3.ugent.be/media/uploads/tools/Cognate_GS_eM67Zdk.zip Download]
* [https://lt3.ugent.be/media/uploads/tools/Cognate_GS_eM67Zdk.zip Download]
<!--T:20-->
== Basiswoordenlijst Amsterdamse Kleuters ==
The Basic Wordlist Amsterdam Toddlers consists of 3000 words: 2000 basic words and 1000 extension words. The wordlist is split up in words for group 1 and words for group 2.
<!--T:21-->
* [https://woorden.wiki.kennisnet.nl/Baklijsten Website]
</translate>

Latest revision as of 12:27, 26 March 2024

Other languages:

Woordenlijst van de Nederlandse Taal

Since 1804, our spelling has been fixed by the government. This includes basic principles and specific rules, such as those for spelling vowels and consonants, the use of capitals and characters (accents, hyphens, punctuation marks and apostrophes), the spelling of compounds with a middle sound (pancake, briefcase) and the division of words into syllables. In addition, the government publishes a list of words that are spelled according to the rules and others that are difficult to derive from rules, for example words that we adopt from other languages.

At the end of 2015, the Woordenlijst van de Nederlandse Taal contained over 180,000 keywords. In the online version provided with the Woordenlijst (woordenlijst.org) these words can all be found, amply provided with data on hyphenation, inflection and conjugation.

Subtlex NL

SUBTLEX-NL is a database of Dutch word frequencies based on 44 million words from film and television subtitles.

  • Project page
  • Download page
  • Reference: Keuleers, E., Brysbaert, M. & New, B. (2010). SUBTLEX-NL: A new frequency measure for Dutch words based on film subtitles. Behavior Research Methods, 42(3), 643-650.

CombiLex

CombiLex is a list of Dutch lemmas and word forms without further annotation. The lexicon contains over 213.000 unique lemmas and over 442.000 unique lemmas and word forms.

INT Historische Woordenlijst

The INT Historical Wordlist consists of 2 lists with each ca. 500.000 historical word forms for the purpose of OCR and OCR post-correction for the period ca. 1550 - ca. 1970.

CHN N-grams

N-grams (lengths one, two, and three) and their frequencies from the Corpus of Contemporary Dutch (CHN).

Middle Dutch syllabified words

This wordlist contains 43,710 syllabified Middle Dutch words, which is the total amount of unique words from the Corpus Van Reenen-Mulder. This corpus, created by Pieter van Reenen en Maaike Mulder at the Free University Amsterdam, contains about 2,500 Middle Dutch charters. It has about 750,000 tokens. The charters were written in the Netherlands and Flanders between 1300 and 1400.

RND Woordenlijsten

The RND Word Lists contain phonetic transcriptions of dialect words collected in the Netherlands and Belgium. They were originally published in the "Reeks Nederlandse Dialectatlassen".

Cognates NL-EN-FR

Gold Standard for Cognate Pairs in English-Dutch and French-Dutch. Reference: Labat, S. and Lefever, E. (2020). Gold Standard for Cognate Pairs in English-Dutch and French-Dutch. LT3, Ghent University, 1.0, ISLRN 288-099-424-255-6

Basiswoordenlijst Amsterdamse Kleuters

The Basic Wordlist Amsterdam Toddlers consists of 3000 words: 2000 basic words and 1000 extension words. The wordlist is split up in words for group 1 and words for group 2.