Export translations

Settings

Group

Language

Format

Export for off-line translation

Export in native format

Export in CSV format

<languages/>
<div lang="en" dir="ltr" class="mw-content-ltr">
This page lists the questions we received.
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
==How can I get access to CLARIN tools and resources, without an academic account?==
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
It is possible to ask for an account at the [https://idm.clarin.eu/unitygw/pub#!registration-CLARIN%20Identity%20Registration CLARIN Account Registration] page.
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
==Do you have any domain specific corpora?==
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
On the [https://kdutch.ivdnt.org/wiki/K-Dutch#Corpora main page] you find a listing of different types of corpora we have. Domain specific corpora are the [[Parliamentary corpora]] and the [[Corpora of academic texts]]. Under the [[Parallel corpora]] there are also domain specific corpora.
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
==Are there literary texts available?==
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
From the [https://kdutch.ivdnt.org/wiki/Historical_corpora#Public_Domain_Data_.40_DBNL Public Domain Page] you can find a link to the downloadable public domain files in DBNL.
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
==Is there a speech recognition engine available for Belgian Dutch?==
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
Since April 2022, there is a new ASR engine available, specifically suited for speech recognition for Belgian Dutch.
*[https://www.spraak.org/webservice/dutch_asr/ Online webservice] TEMPORARILY UNAVAILABLE
*[https://clinjournal.org/clinj/article/view/119 Scientific publication about speech recognition engine]
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
Check also the page dedicated to [[Speech_recognition]] systems
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
==Which corpora are available for Automatic Simplification for Dutch?==
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
There are currently no parallel corpora available in which regular Dutch has been simplified, so this makes it impossible to straightforwardly treat this as a machine translation problem.
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
If you would consider to develop a form of unsupervised simplification, there are, however, a number of corpora available which can be considered to be in a form of easy Dutch. These corpora are the [http://hdl.handle.net/10032/tm-a2-q6 Wablieft-corpus] (Easy Belgian Dutch), the [http://hdl.handle.net/10032/tm-a2-n4 Basilex-corpus] (Texts for children in Dutch primary schools), and [http://hdl.handle.net/10032/tm-a2-t9 WAI-NOT] (Very easy Belgian Dutch).
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
==Are there any corpora that contain dialogues between two or more people?==
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
There are a number of dialog components in CGN (Spoken Dutch Corpus). 
* a. Spontane conversaties ('face-to-face')
* c. Telefoondialogen opgenomen m.b.v. platform
* d. Telefoondialogen opgenomen m.b.v. minidiskrecorder
* e. Zakelijke onderhandelingen
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
* Download: https://taalmaterialen.ivdnt.org/download/tstc-corpus-gesproken-nederlands/
* Online search: https://portal.clarin.inl.nl/opensonar_frontend/opensonar/search
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
There is also the [https://www.fon.hum.uva.nl/IFA-SpokenLanguageCorpora/IFADVcorpus/ IFA Dialog Video corpus].
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
"A collection of annotated video recordings of friendly Face-to-Face dialogs. It is modelled on the Face-to-Face dialogs in the Spoken Dutch Corpus (CGN). The procedures and design of the corpus were adapted to make this corpus useful for other researchers of Dutch speech. For this corpus 20 dialog conversations of 15 minutes were recorded and annotated, in total 5 hours of speech."
* Download: https://taalmaterialen.ivdnt.org/download/tstc-ifa-dialoog-videocorpus/
* Online data: https://www.fon.hum.uva.nl/IFA-SpokenLanguageCorpora/IFADVcorpus/
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
==Advice for finding financial support for compiling a medical comparable corpus English-Dutch==
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
We are looking into whether this is fundable by [https://www.clarin.eu/content/clarin-resource-families-project-funding CLARIN Resource Families Project Funding]. The site indicates that it is best to first submit the idea informally to the CLARIN office, so they can advise us ("In view of the flexible nature of this call, applicants are encouraged to send in a project idea beforehand, in order to allow CLARIN Office to give additional guidelines and assess the eligibility of plans.")
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
We would need to be clear though as to whether this is a parallel corpus, which is one of the categories in the Resource Families, or whether it is a comparable corpus, which is not one of the categories. We might ask the CLARIN office whether they think it would be useful to add such a category.
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
We would have to identify a number of potential data sources, and make sure we can make the collected data publicly available for research, without GDPR or IP issues.
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
We should be aware of the [https://opus.nlpl.eu/EMEA/corpus/version/EMEA EMEA corpus in OPUS]
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
==Is it possible to automate finding of word conversions for specific corpora?==
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
Do you think it is possible to draw up a list of conversion pairs of Dutch, i.e. words that can be used in more than 1 part of speech, on the basis of corpora (or possibly treebanks)? I am particularly concerned with the parts of speech nomen, adjective, and verb. So, for example, the search algorithm should be able to identify the bold words in the following examples as conversion pairs:
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
* ik '''douche''' / ik neem een '''douche'''
* wij '''geloven''' in iets / zijn '''geloof''' in iets
* de '''crimineel''' zweert zijn '''criminele''' gedrag af
* wij '''onderhielden''' contacten / het '''onderhoud''' van het huis
* wij '''droogden''' het '''droge''' laken
* we '''trommelden''' op de '''trommel'''
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
Answer: The [https://taalmaterialen.ivdnt.org/download/tstc-e-lex/ e-Lex lexicon] allows you to search word forms with multiple POS tags, as you ask. This lexicon is based on CGN. 
But your question goes a little further, I think. The verb '''geloven''' has the lemma "geloven", and the conversion to noun has the lemma '''geloof'''. So we should see whether the noun's lemma also occurs as a verb form, idem ditto for adjectives. A perl script was written that extracts the requested sets from the lexicon file -- results were sent to the requester.
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
==I am looking for spoken and written corpora for a contrastive study German/Dutch in which I can find actual word forms==
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
We refer you to [http://opensonar.ivdnt.org/ OpenSonar] which is the only search engine for both the Spoken Dutch Corpus (CGN) and the SoNaR reference corpus and is available with CLARIN login. An alternative can be the [ http://chn.ivdnt.org/ Corpus Hedendaags Nederlands (CHN) website] which is the online search engine for the Corpus of Contemporary Dutch (CHN). If you need more recent data, at INT we have a monitor corpus with weekly newspaper dumps at our disposal in which we can launch searches for you -- unfortunately we cannot make this monitor corpus available due to IP restrictions.
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
==I want to find all Dutch lemmas in which there is double derivation, can you help me?==
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
The [http://hdl.handle.net/10032/tm-a2-h2 eLex lexicon] contains as its third data field the morphology of lemmas. We extracted all rows in the data in which the sign for derivation (|) occurs twice in a row and provided our user with a detailed list of entries and how often they occur in the e-Lex.
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
==Can I get a distribution of the suffixes on Dutch adjectives?==
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
The [http://hdl.handle.net/10032/tm-a2-h2 eLex lexicon] contains as its third data field the morphology of lemmas. We counted, per lemma id that is an adjective the frequency of the last suffix. For no morphology, we assigned category '0'.
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
<pre>
0	11781
(ig)	1781
(achtig)	507
(baar)	473
(isch)	431
(elijk)	392
(end)	367
(en)	292
(s)	278
(lijk)	237
(erig)	229
(ief)	168
(aal)	155
(loos)	138
(d)	116
...
</pre>
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
==I want to use the CGN wave files, but found a dead link on the original website==
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
There is a permanent link for the CGN wave file download page: [http://hdl.handle.net/10032/tm-a2-k6 http://hdl.handle.net/10032/tm-a2-k6]
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
==Which treebanks are available for Dutch?==
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
We have added the [[Treebanks]] page to this wiki to answer this question.
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
==Is there a corpus with imperative sentences?==
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
There is no such explicit corpus available. If we provide [https://kdutch.ivdnt.org/wiki/Treebank_querying#GrETEL GrETEL] with an imperative example we can extract similar sentences, which should be usable as an imperative corpus.
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
==We will run some topic modeling analyses on some Flemish/Belgian Dutch data we have. Because our data set is relatively small for this kind of task, the idea is to train the topic model on a much larger corpus (e.g. social media posts). Do you know of any such corpus that might be available? ==
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
Take a look at [https://kdutch.ivdnt.org/wiki/K-Dutch#Corpora]
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
==How can I calculate the readability of Dutch text?==
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
There is a tool called [https://tscan.hum.uu.nl/tscan/ T-scan] that may be helpful there.
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
==How can I calculate Flesch-Douma for Dutch?==
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
The formula for Flesch-Douma requires two things to be counted: number of words in a sentence and number of syllables per word. While the number of words in a sentence is easily counted with any scripting language, the nr of syllables may seem more difficult. The [ http://hdl.handle.net/10032/tm-a2-h2 e-Lex] lexicon contains hyphenation patterns and hence the number of syllables per word. An alternative is to count the number of vowel clusters in each word using regular expressions, which should also give you the number of syllables.
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
==I am looking for a parallel corpus of Dutch-Turkish texts.==
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
We are comparing the Dutch and Turkish translations of the Linguistic Inquiry and Word Count [LIWC] dictionaries. Do you know of any corpora that would be suitable?
I found several candidates on OPUS (https://opus.nlpl.eu/), and downloaded the TED2020 talks. However these are .xml files with paragraph/line IDs and I need .txt files. Would you have a script or a way to automatically recode them and remove the unnecessary tags?
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
We would also refer you to OPUS, you can find parallel txt files if you download the moses format -- then you get a zip which contains a .nl and a .tr file, and these are sentence aligned. i.e. the same line number in the two files should be translations of each other.
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
==Do you know, are there any reasonable sentiment analysis algorithms/approaches for Dutch?==
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
We have now added a page on [[sentiment analysis]] to this wiki.
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
==Are there any spoken corpora available of spontaneous speech with time stamped transcriptions, which are freely available?==
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
On [https://kdutch.ivdnt.org/wiki/Spoken_corpora] we've collected what is available for Dutch.
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
The Corpus Gesproken Nederlands (CGN) has a section of spontaneous speech, with time-stamped transcriptions, freely available.
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
==How can I combine search for "green" and "red" word order in OpenSonar?==
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
In principle it is possible to ask for both orders at the same time, see the example for more info.
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
*[https://portal.clarin.inl.nl/opensonar_frontend/opensonar/search/hits?filter=Corpus_title%3A%28%22CGN%22%29&first=0&number=20&patt=%5Blemma%3D%22hebben%7Czijn%22+%26+pos_head%3D%22ww%22%26pos_wvorm%3D%22pv%22%5D%5Bpos_head%3D%22ww%22%26pos_wvorm%3D%22vd%22%5D+%7C+%5Bpos_head%3D%22ww%22%26pos_wvorm%3D%22vd%22%5D%5Blemma%3D%22hebben%7Czijn%22+%26+pos_head%3D%22ww%22%26pos_wvorm%3D%22pv%22%5D&interface=%7B%22form%22%3A%22search%22%2C%22patternMode%22%3A%22expert%22%7D Example]
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
==Can you give advice on setting up a transcription process for a spoken language corpus?==
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
A meeting was held in which we discussed the use of speech recognition, segmentation, speaker diarisation, and post-editing of speech recognition. We have given the advice to include K-Dutch into the project proposal so that K-Dutch can take care of converting ASR output to ELAN tiers, merging of ELAN tiers etc.
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
==What are the character n-gram frequencies for Dutch?==
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
We have counted the n-gram frequencies up to trigrams and made them available at [[Character_N-grams]].
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
==Is there an user friendly interface for working with the EMEA part of the Lassy-Large treebank?==
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
Lassy Large is extremely large, and therefore not entirely available through online query tools such as GrETEL and PaQu, although the latter provides access to the newspaper part. A suggestion is to download the data and then import it into an xml database engine, such as [https://basex.org/ Basex] which will allow you to query it with Xpath and Xquery.
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
==What are good POS taggers for non-standard Dutch?==
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
See [[Basic_language_processing]] page.
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
==Are there any lexical profiling tools for Dutch?==
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
The request is whether there are any user friendly and freely available tools with which teachers can assess the lexical profile of a text (to which frequency levels do the words belong, how many of the most frequent words should a reader know to understand 95% etc.)
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
While we are not aware of any tools that do such a thing explicitly, there are a number of tools that go partly that way.
We suggest taking a look at
*[https://lint.gebruikercentraal.nl/over-lint/ LINT] which assesses the readability of a text
*[https://tscan.hum.uu.nl/tscan/ T-scan] which is what LINT is based on, and which is a bit less user friendly
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
There are also a number of tools that can be found at 
[https://ilt.kuleuven.be/aanbod/index.php Instituut voor Levende Talen]
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
== Is there a word list of difficult words for Dutch? ==
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
While this question is about ''difficult'' words, we mainly have word lists of easy words at our disposal:
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
* There is the BasiLex lexicon that contains words aimed at children up to 12 years old 
* There is the BasiScript lexicon that contains words produced by children up to 12 years old
* There are frequency lists available for a number of different corpora, each listing the 5000 most frequent words: http://hdl.handle.net/10032/tm-a2-f8
* The NT2Lex provides a lexicon with words and how they occur in texts aimed at specific CEFR levels.
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
If a word does not occur in a list of easy words, that could be used as a measure for word difficulty. But, difficulty is determined more than just by words, i.e. there is also syntactic complexity.
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
You could use e.g. the T-scan tool that measures complexity of texts.
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
== Is there a text only version of the Corpus Spoken Dutch? ==
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
The download files of the Corpus Spoken Dutch (CGN) do not contain the text only. The <code>ort</code> files contain ortographic transcriptions and timestamps and the <code>plk</code> files contain part-of-speech and lemma information.  The following perl script takes a list of plk files as input and prints the text. If you run this script from the command line in your terminal, then you can create text files.
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
<code>binmode(STDOUT, ":utf8");</code>
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
<code>while ($in=shift(@ARGV)) {</code>
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
<code>   print STDERR "Processing $in\n";</code>
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
<code>   while (<F>) {</code>
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
<code>           if (@words>0) {</code>
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
<code>               print join(" ",@words)."\n";</code>
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
<code>           @words=();</code>
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
<code>           ($word,$tag,$lem)=split(/\t/);</code>
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
<code>           push(@words,$word);</code>
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
==Where can I find when which recording was made for the Corpus Spoken Dutch?==
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
The CGN contains a metadata-spreadsheet that contains information in what year each of the recordings was made.
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
==CGN and Frequentielijsten_corpora_4.0.1==
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
The Frequentielijten_corpora doc states:
Het product Frequentielijsten Corpora is een verzameling lijsten van de 5000 meest voorkomende woorden en hun frequentie in een aantal corpora die beschikbaar zijn bij de TST-Centrale.
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
Inspecting CGN.woordvorm.txt, I see that the word “is” has two entries:
is 141417
is/uncertain 404
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
“Het Corpus Gesproken Nederlands (CGN) is een verzameling van 900 uur (bijna 9 miljoen woorden) hedendaagse Nederlandse spraak, afkomstig van Vlamingen en Nederlanders.”
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
How can the word “is” have only a frequency of 141417 in a list of almost 9 million words?
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
In SONAR500.wordfreqlist.1-gram.total.top5000.tsv, I see the following:
is 5736376 122784975 22.9366
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
This seems more realistic than 141417.
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
Can anyone explain the CGN frequency number?
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
'''Answer''':
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
* I assume the is/uncertain is something that was hard to understand (as it is a transcribed spoken corpus) and is therefore transcibed as is/uncertain.
* I’ve double checked the CGN.woordvorm.txt frequency of “is” in the online version of CGN: [https://portal.clarin.ivdnt.org/opensonar_frontend/opensonar/search/hits?filter=Corpus_title%3A%28%22CGN%22%29&first=0&group=hit%3Aword%3Ai&number=20&patt=%5B%5D&interface=%7B%22form%22%3A%22explore%22%2C%22exploreMode%22%3A%22ngram%22%7D Here]. There the number is the same, and it is 1.41% of all tokens.
* The numbers in SONAR500 are the frequency of the word, the cumulative frequency, and the cumulative relative frequency. SONAR500 contains about 500 million tokens. The total frequency of “is” amounts to 1.09% of the total corpus, which is actually quite similar.
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
So it seems to me that these numbers are correct and that the word form is occurs 1.41% of the tokens in spoken Dutch and 1.09% in written Dutch.
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
==I'm interested in finding out the percentage of common nouns that are het words versus de words.==
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
If possible, I'd like to know what that distribution is for the 1000 most commonly used nouns, the 2000 most commonly used nouns, and the 3000 most commonly used nouns.
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
'''Answer'''
If you go to https://opensonar.ivdnt.org, you can go to the Explore tab and then with the following search criteria
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
N-gram type = Part-of-speech with features
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
Part-of-Speech with features: N*
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
Corpus: CGN + SONAR
</div>

<div lang="en" dir="ltr" class="mw-content-ltr">
This will give you the proportion of all pos-tags for Nouns in the combined CGN and SONAR corpus. You can refine the query to your needs and download the results as a csv file.
</div>