Export translations
Jump to navigation
Jump to search
Settings
Group
Basic language processing
Best practice documents and guidelines
Character N-grams
CLARIN
CLARIN projects
Clinical NLP
Compound splitting
Computer-mediated communication corpora
Conceptual resources
Consulting
Coreference resolution
Corpora
Corpora of academic texts
Corpus querying
Deep parsing
Dictionaries
Embeddings
Format conversion
Grammar
Historical corpora
Internships
K-Dutch
L2 learner corpora
Language modeling
Lexica
Lexica of terminology
Lexicography
Machine translation
Manually annotated corpora
Multimodal corpora
Newspaper corpora
Ontologies
Other corpora
Parallel corpora
Parallel Monolingual Corpora
Parallel Multilingual Corpora
Parliamentary corpora
Propbanks
Q&A
Readability
Reference corpora
Sentiment analysis
Sign Language corpora
Simplification Data
Social media corpora
Speech recognition
Spell checking
Spelling
Spoken corpora
Spoken language recognition
Syllabification
Terminology
Text classification
Text simplification
Treebank querying
Treebanks
Word sense disambiguation
Wordlists
Language
aa - Afar
ab - Abkhazian
abs - Ambonese Malay
ace - Achinese
acm - Iraqi Arabic
ady - Adyghe
ady-cyrl - Adyghe (Cyrillic script)
aeb - Tunisian Arabic
aeb-arab - Tunisian Arabic (Arabic script)
aeb-latn - Tunisian Arabic (Latin script)
af - Afrikaans
aln - Gheg Albanian
alt - Southern Altai
am - Amharic
ami - Amis
an - Aragonese
ang - Old English
ann - Obolo
anp - Angika
ar - Arabic
arc - Aramaic
arn - Mapuche
arq - Algerian Arabic
ary - Moroccan Arabic
arz - Egyptian Arabic
as - Assamese
ase - American Sign Language
ast - Asturian
atj - Atikamekw
av - Avaric
avk - Kotava
awa - Awadhi
ay - Aymara
az - Azerbaijani
azb - South Azerbaijani
ba - Bashkir
ban - Balinese
ban-bali - Balinese (Balinese script)
bar - Bavarian
bbc - Batak Toba
bbc-latn - Batak Toba (Latin script)
bcc - Southern Balochi
bci - Baoulé
bcl - Central Bikol
bdr - West Coast Bajau
be - Belarusian
be-tarask - Belarusian (Taraškievica orthography)
bew - Betawi
bg - Bulgarian
bgn - Western Balochi
bh - Bhojpuri
bho - Bhojpuri
bi - Bislama
bjn - Banjar
blk - Pa'O
bm - Bambara
bn - Bangla
bo - Tibetan
bpy - Bishnupriya
bqi - Bakhtiari
br - Breton
brh - Brahui
bs - Bosnian
btm - Batak Mandailing
bto - Iriga Bicolano
bug - Buginese
bxr - Russia Buriat
ca - Catalan
cbk-zam - Chavacano
cdo - Min Dong Chinese
ce - Chechen
ceb - Cebuano
ch - Chamorro
cho - Choctaw
chr - Cherokee
chy - Cheyenne
ckb - Central Kurdish
co - Corsican
cps - Capiznon
cpx - Pu-Xian Min
cpx-hans - Pu-Xian Min (Simplified Han script)
cpx-hant - Pu-Xian Min (Traditional Han script)
cpx-latn - Pu-Xian Min (Latin script)
cr - Cree
crh - Crimean Tatar
crh-cyrl - Crimean Tatar (Cyrillic script)
crh-latn - Crimean Tatar (Latin script)
crh-ro - Crimean Tatar (Romania)
cs - Czech
csb - Kashubian
cu - Church Slavic
cv - Chuvash
cy - Welsh
da - Danish
dag - Dagbani
de - German
de-at - Austrian German
de-ch - Swiss High German
de-formal - German (formal address)
dga - Dagaare
din - Dinka
diq - Zazaki
dsb - Lower Sorbian
dtp - Central Dusun
dty - Doteli
dv - Divehi
dz - Dzongkha
ee - Ewe
egl - Emilian
el - Greek
eml - Emiliano-Romagnolo
en - English
en-ca - Canadian English
en-gb - British English
eo - Esperanto
es - Spanish
es-419 - Latin American Spanish
es-formal - Spanish (formal address)
et - Estonian
eu - Basque
ext - Extremaduran
fa - Persian
fat - Fanti
ff - Fula
fi - Finnish
fit - Tornedalen Finnish
fj - Fijian
fo - Faroese
fon - Fon
fr - French
frc - Cajun French
frp - Arpitan
frr - Northern Frisian
fur - Friulian
fy - Western Frisian
ga - Irish
gaa - Ga
gag - Gagauz
gan - Gan Chinese
gan-hans - Gan (Simplified)
gan-hant - Gan (Traditional)
gcr - Guianan Creole
gd - Scottish Gaelic
gl - Galician
gld - Nanai
glk - Gilaki
gn - Guarani
gom - Goan Konkani
gom-deva - Goan Konkani (Devanagari script)
gom-latn - Goan Konkani (Latin script)
gor - Gorontalo
got - Gothic
gpe - Ghanaian Pidgin
grc - Ancient Greek
gsw - Alemannic
gu - Gujarati
guc - Wayuu
gur - Frafra
guw - Gun
gv - Manx
ha - Hausa
hak - Hakka Chinese
haw - Hawaiian
he - Hebrew
hi - Hindi
hif - Fiji Hindi
hif-latn - Fiji Hindi (Latin script)
hil - Hiligaynon
hno - Northern Hindko
ho - Hiri Motu
hr - Croatian
hrx - Hunsrik
hsb - Upper Sorbian
hsn - Xiang Chinese
ht - Haitian Creole
hu - Hungarian
hu-formal - Hungarian (formal address)
hy - Armenian
hyw - Western Armenian
hz - Herero
ia - Interlingua
id - Indonesian
ie - Interlingue
ig - Igbo
igl - Igala
ii - Sichuan Yi
ik - Inupiaq
ike-cans - Eastern Canadian (Aboriginal syllabics)
ike-latn - Eastern Canadian (Latin script)
ilo - Iloko
inh - Ingush
io - Ido
is - Icelandic
it - Italian
iu - Inuktitut
ja - Japanese
jam - Jamaican Creole English
jbo - Lojban
jut - Jutish
jv - Javanese
ka - Georgian
kaa - Kara-Kalpak
kab - Kabyle
kai - Karekare
kbd - Kabardian
kbd-cyrl - Kabardian (Cyrillic script)
kbp - Kabiye
kcg - Tyap
kea - Kabuverdianu
kg - Kongo
khw - Khowar
ki - Kikuyu
kiu - Kirmanjki
kj - Kuanyama
kjh - Khakas
kjp - Eastern Pwo
kk - Kazakh
kk-arab - Kazakh (Arabic script)
kk-cn - Kazakh (China)
kk-cyrl - Kazakh (Cyrillic script)
kk-kz - Kazakh (Kazakhstan)
kk-latn - Kazakh (Latin script)
kk-tr - Kazakh (Turkey)
kl - Kalaallisut
km - Khmer
kn - Kannada
ko - Korean
ko-kp - Korean (North Korea)
koi - Komi-Permyak
kr - Kanuri
krc - Karachay-Balkar
kri - Krio
krj - Kinaray-a
krl - Karelian
ks - Kashmiri
ks-arab - Kashmiri (Arabic script)
ks-deva - Kashmiri (Devanagari script)
ksh - Colognian
ksw - S'gaw Karen
ku - Kurdish
ku-arab - Kurdish (Arabic script)
ku-latn - Kurdish (Latin script)
kum - Kumyk
kus - Kʋsaal
kv - Komi
kw - Cornish
ky - Kyrgyz
la - Latin
lad - Ladino
lb - Luxembourgish
lbe - Lak
lez - Lezghian
lfn - Lingua Franca Nova
lg - Ganda
li - Limburgish
lij - Ligurian
liv - Livonian
lki - Laki
lld - Ladin
lmo - Lombard
ln - Lingala
lo - Lao
loz - Lozi
lrc - Northern Luri
lt - Lithuanian
ltg - Latgalian
lus - Mizo
luz - Southern Luri
lv - Latvian
lzh - Literary Chinese
lzz - Laz
mad - Madurese
mag - Magahi
mai - Maithili
map-bms - Basa Banyumasan
mdf - Moksha
mg - Malagasy
mh - Marshallese
mhr - Eastern Mari
mi - Māori
min - Minangkabau
mk - Macedonian
ml - Malayalam
mn - Mongolian
mnc - Manchu
mnc-latn - Manchu (Latin script)
mnc-mong - Manchu (Mongolian script)
mni - Manipuri
mnw - Mon
mo - Moldovan
mos - Mossi
mr - Marathi
mrh - Mara
mrj - Western Mari
ms - Malay
ms-arab - Malay (Jawi script)
mt - Maltese
mus - Muscogee
mwl - Mirandese
my - Burmese
myv - Erzya
mzn - Mazanderani
na - Nauru
nah - Nāhuatl
nan - Min Nan Chinese
nap - Neapolitan
nb - Norwegian Bokmål
nds - Low German
nds-nl - Low Saxon
ne - Nepali
new - Newari
ng - Ndonga
nia - Nias
niu - Niuean
nl - Dutch
nl-informal - Dutch (informal address)
nmz - Nawdm
nn - Norwegian Nynorsk
no - Norwegian
nod - Northern Thai
nog - Nogai
nov - Novial
nqo - N’Ko
nrm - Norman
nso - Northern Sotho
nv - Navajo
ny - Nyanja
nyn - Nyankole
nys - Nyungar
oc - Occitan
ojb - Northwestern Ojibwa
olo - Livvi-Karelian
om - Oromo
or - Odia
os - Ossetic
pa - Punjabi
pag - Pangasinan
pam - Pampanga
pap - Papiamento
pcd - Picard
pcm - Nigerian Pidgin
pdc - Pennsylvania German
pdt - Plautdietsch
pfl - Palatine German
pi - Pali
pih - Norfuk / Pitkern
pl - Polish
pms - Piedmontese
pnb - Western Punjabi
pnt - Pontic
prg - Prussian
ps - Pashto
pt - Portuguese
pt-br - Brazilian Portuguese
pwn - Paiwan
qqq - Message documentation
qu - Quechua
qug - Chimborazo Highland Quichua
rgn - Romagnol
rif - Riffian
rki - Arakanese
rm - Romansh
rmc - Carpathian Romani
rmy - Vlax Romani
rn - Rundi
ro - Romanian
roa-tara - Tarantino
rsk - Pannonian Rusyn
ru - Russian
rue - Rusyn
rup - Aromanian
ruq - Megleno-Romanian
ruq-cyrl - Megleno-Romanian (Cyrillic script)
ruq-latn - Megleno-Romanian (Latin script)
rw - Kinyarwanda
ryu - Okinawan
sa - Sanskrit
sah - Yakut
sat - Santali
sc - Sardinian
scn - Sicilian
sco - Scots
sd - Sindhi
sdc - Sassarese Sardinian
sdh - Southern Kurdish
se - Northern Sami
se-fi - Northern Sami (Finland)
se-no - Northern Sami (Norway)
se-se - Northern Sami (Sweden)
sei - Seri
ses - Koyraboro Senni
sg - Sango
sgs - Samogitian
sh - Serbo-Croatian
sh-cyrl - Serbo-Croatian (Cyrillic script)
sh-latn - Serbo-Croatian (Latin script)
shi - Tachelhit
shi-latn - Tachelhit (Latin script)
shi-tfng - Tachelhit (Tifinagh script)
shn - Shan
shy - Shawiya
shy-latn - Shawiya (Latin script)
si - Sinhala
simple - Simple English
sjd - Kildin Sami
sje - Pite Sami
sk - Slovak
skr - Saraiki
skr-arab - Saraiki (Arabic script)
sl - Slovenian
sli - Lower Silesian
sm - Samoan
sma - Southern Sami
smn - Inari Sami
sms - Skolt Sami
sn - Shona
so - Somali
sq - Albanian
sr - Serbian
sr-ec - српски (ћирилица)
sr-el - srpski (latinica)
srn - Sranan Tongo
sro - Campidanese Sardinian
ss - Swati
st - Southern Sotho
stq - Saterland Frisian
sty - Siberian Tatar
su - Sundanese
sv - Swedish
sw - Swahili
syl - Sylheti
szl - Silesian
szy - Sakizaya
ta - Tamil
tay - Tayal
tcy - Tulu
tdd - Tai Nuea
te - Telugu
tet - Tetum
tg - Tajik
tg-cyrl - Tajik (Cyrillic script)
tg-latn - Tajik (Latin script)
th - Thai
ti - Tigrinya
tk - Turkmen
tl - Tagalog
tly - Talysh
tly-cyrl - Talysh (Cyrillic script)
tn - Tswana
to - Tongan
tok - Toki Pona
tpi - Tok Pisin
tr - Turkish
tru - Turoyo
trv - Taroko
ts - Tsonga
tt - Tatar
tt-cyrl - Tatar (Cyrillic script)
tt-latn - Tatar (Latin script)
tum - Tumbuka
tw - Twi
ty - Tahitian
tyv - Tuvinian
tzm - Central Atlas Tamazight
udm - Udmurt
ug - Uyghur
ug-arab - Uyghur (Arabic script)
ug-latn - Uyghur (Latin script)
uk - Ukrainian
ur - Urdu
uz - Uzbek
uz-cyrl - Uzbek (Cyrillic script)
uz-latn - Uzbek (Latin script)
ve - Venda
vec - Venetian
vep - Veps
vi - Vietnamese
vls - West Flemish
vmf - Main-Franconian
vmw - Makhuwa
vo - Volapük
vot - Votic
vro - Võro
wa - Walloon
wal - Wolaytta
war - Waray
wls - Wallisian
wo - Wolof
wuu - Wu Chinese
wuu-hans - Wu Chinese (Simplified)
wuu-hant - Wu Chinese (Traditional)
xal - Kalmyk
xh - Xhosa
xmf - Mingrelian
xsy - Saisiyat
yi - Yiddish
yo - Yoruba
yrl - Nheengatu
yue - Cantonese
yue-hans - Cantonese (Simplified)
yue-hant - Cantonese (Traditional)
za - Zhuang
zea - Zeelandic
zgh - Standard Moroccan Tamazight
zh - Chinese
zh-cn - Chinese (China)
zh-hans - Simplified Chinese
zh-hant - Traditional Chinese
zh-hk - Chinese (Hong Kong)
zh-mo - Chinese (Macau)
zh-my - Chinese (Malaysia)
zh-sg - Chinese (Singapore)
zh-tw - Chinese (Taiwan)
zu - Zulu
Format
Export for off-line translation
Export in native format
Export in CSV format
Fetch
<languages/> <div lang="en" dir="ltr" class="mw-content-ltr"> This page lists the questions we received. </div> <div lang="en" dir="ltr" class="mw-content-ltr"> ==How can I get access to CLARIN tools and resources, without an academic account?== </div> <div lang="en" dir="ltr" class="mw-content-ltr"> It is possible to ask for an account at the [https://idm.clarin.eu/unitygw/pub#!registration-CLARIN%20Identity%20Registration CLARIN Account Registration] page. </div> <div lang="en" dir="ltr" class="mw-content-ltr"> ==Do you have any domain specific corpora?== </div> <div lang="en" dir="ltr" class="mw-content-ltr"> On the [https://kdutch.ivdnt.org/wiki/K-Dutch#Corpora main page] you find a listing of different types of corpora we have. Domain specific corpora are the [[Parliamentary corpora]] and the [[Corpora of academic texts]]. Under the [[Parallel corpora]] there are also domain specific corpora. </div> <div lang="en" dir="ltr" class="mw-content-ltr"> ==Are there literary texts available?== </div> <div lang="en" dir="ltr" class="mw-content-ltr"> From the [https://kdutch.ivdnt.org/wiki/Historical_corpora#Public_Domain_Data_.40_DBNL Public Domain Page] you can find a link to the downloadable public domain files in DBNL. </div> <div lang="en" dir="ltr" class="mw-content-ltr"> ==Is there a speech recognition engine available for Belgian Dutch?== </div> <div lang="en" dir="ltr" class="mw-content-ltr"> Since April 2022, there is a new ASR engine available, specifically suited for speech recognition for Belgian Dutch. *[https://www.spraak.org/webservice/dutch_asr/ Online webservice] TEMPORARILY UNAVAILABLE *[https://clinjournal.org/clinj/article/view/119 Scientific publication about speech recognition engine] </div> <div lang="en" dir="ltr" class="mw-content-ltr"> Check also the page dedicated to [[Speech_recognition]] systems </div> <div lang="en" dir="ltr" class="mw-content-ltr"> ==Which corpora are available for Automatic Simplification for Dutch?== </div> <div lang="en" dir="ltr" class="mw-content-ltr"> There are currently no parallel corpora available in which regular Dutch has been simplified, so this makes it impossible to straightforwardly treat this as a machine translation problem. </div> <div lang="en" dir="ltr" class="mw-content-ltr"> If you would consider to develop a form of unsupervised simplification, there are, however, a number of corpora available which can be considered to be in a form of easy Dutch. These corpora are the [http://hdl.handle.net/10032/tm-a2-q6 Wablieft-corpus] (Easy Belgian Dutch), the [http://hdl.handle.net/10032/tm-a2-n4 Basilex-corpus] (Texts for children in Dutch primary schools), and [http://hdl.handle.net/10032/tm-a2-t9 WAI-NOT] (Very easy Belgian Dutch). </div> <div lang="en" dir="ltr" class="mw-content-ltr"> ==Are there any corpora that contain dialogues between two or more people?== </div> <div lang="en" dir="ltr" class="mw-content-ltr"> There are a number of dialog components in CGN (Spoken Dutch Corpus). * a. Spontane conversaties ('face-to-face') * c. Telefoondialogen opgenomen m.b.v. platform * d. Telefoondialogen opgenomen m.b.v. minidiskrecorder * e. Zakelijke onderhandelingen </div> <div lang="en" dir="ltr" class="mw-content-ltr"> * Download: https://taalmaterialen.ivdnt.org/download/tstc-corpus-gesproken-nederlands/ * Online search: https://portal.clarin.inl.nl/opensonar_frontend/opensonar/search </div> <div lang="en" dir="ltr" class="mw-content-ltr"> There is also the [https://www.fon.hum.uva.nl/IFA-SpokenLanguageCorpora/IFADVcorpus/ IFA Dialog Video corpus]. </div> <div lang="en" dir="ltr" class="mw-content-ltr"> "A collection of annotated video recordings of friendly Face-to-Face dialogs. It is modelled on the Face-to-Face dialogs in the Spoken Dutch Corpus (CGN). The procedures and design of the corpus were adapted to make this corpus useful for other researchers of Dutch speech. For this corpus 20 dialog conversations of 15 minutes were recorded and annotated, in total 5 hours of speech." * Download: https://taalmaterialen.ivdnt.org/download/tstc-ifa-dialoog-videocorpus/ * Online data: https://www.fon.hum.uva.nl/IFA-SpokenLanguageCorpora/IFADVcorpus/ </div> <div lang="en" dir="ltr" class="mw-content-ltr"> ==Advice for finding financial support for compiling a medical comparable corpus English-Dutch== </div> <div lang="en" dir="ltr" class="mw-content-ltr"> We are looking into whether this is fundable by [https://www.clarin.eu/content/clarin-resource-families-project-funding CLARIN Resource Families Project Funding]. The site indicates that it is best to first submit the idea informally to the CLARIN office, so they can advise us ("In view of the flexible nature of this call, applicants are encouraged to send in a project idea beforehand, in order to allow CLARIN Office to give additional guidelines and assess the eligibility of plans.") </div> <div lang="en" dir="ltr" class="mw-content-ltr"> We would need to be clear though as to whether this is a parallel corpus, which is one of the categories in the Resource Families, or whether it is a comparable corpus, which is not one of the categories. We might ask the CLARIN office whether they think it would be useful to add such a category. </div> <div lang="en" dir="ltr" class="mw-content-ltr"> We would have to identify a number of potential data sources, and make sure we can make the collected data publicly available for research, without GDPR or IP issues. </div> <div lang="en" dir="ltr" class="mw-content-ltr"> We should be aware of the [https://opus.nlpl.eu/EMEA.php EMEA corpus in OPUS] </div> <div lang="en" dir="ltr" class="mw-content-ltr"> ==Is it possible to automate finding of word conversions for specific corpora?== </div> <div lang="en" dir="ltr" class="mw-content-ltr"> Do you think it is possible to draw up a list of conversion pairs of Dutch, i.e. words that can be used in more than 1 part of speech, on the basis of corpora (or possibly treebanks)? I am particularly concerned with the parts of speech nomen, adjective, and verb. So, for example, the search algorithm should be able to identify the bold words in the following examples as conversion pairs: </div> <div lang="en" dir="ltr" class="mw-content-ltr"> * ik '''douche''' / ik neem een '''douche''' * wij '''geloven''' in iets / zijn '''geloof''' in iets * de '''crimineel''' zweert zijn '''criminele''' gedrag af * wij '''onderhielden''' contacten / het '''onderhoud''' van het huis * wij '''droogden''' het '''droge''' laken * we '''trommelden''' op de '''trommel''' </div> <div lang="en" dir="ltr" class="mw-content-ltr"> Answer: The [https://taalmaterialen.ivdnt.org/download/tstc-e-lex/ e-Lex lexicon] allows you to search word forms with multiple POS tags, as you ask. This lexicon is based on CGN. But your question goes a little further, I think. The verb '''geloven''' has the lemma "geloven", and the conversion to noun has the lemma '''geloof'''. So we should see whether the noun's lemma also occurs as a verb form, idem ditto for adjectives. A perl script was written that extracts the requested sets from the lexicon file -- results were sent to the requester. </div> <div lang="en" dir="ltr" class="mw-content-ltr"> ==I am looking for spoken and written corpora for a contrastive study German/Dutch in which I can find actual word forms== </div> <div lang="en" dir="ltr" class="mw-content-ltr"> We refer you to [http://opensonar.ivdnt.org/ OpenSonar] which is the only search engine for both the Spoken Dutch Corpus (CGN) and the SoNaR reference corpus and is available with CLARIN login. An alternative can be the [ http://chn.ivdnt.org/ Corpus Hedendaags Nederlands (CHN) website] which is the online search engine for the Corpus of Contemporary Dutch (CHN). If you need more recent data, at INT we have a monitor corpus with weekly newspaper dumps at our disposal in which we can launch searches for you -- unfortunately we cannot make this monitor corpus available due to IP restrictions. </div> <div lang="en" dir="ltr" class="mw-content-ltr"> ==I want to find all Dutch lemmas in which there is double derivation, can you help me?== </div> <div lang="en" dir="ltr" class="mw-content-ltr"> The [http://hdl.handle.net/10032/tm-a2-h2 eLex lexicon] contains as its third data field the morphology of lemmas. We extracted all rows in the data in which the sign for derivation (|) occurs twice in a row and provided our user with a detailed list of entries and how often they occur in the e-Lex. </div> <div lang="en" dir="ltr" class="mw-content-ltr"> <pre> (ing)[N|V.](s)[N|N.N] 9344 (heid)[N|A.](s)[N|N.N] 1470 (er)[N|V.](s)[N|N.N] 1230 (ig)[A|N.](heid)[N|A.] 937 (atie)[N|V.](ief)[A|N.] 769 (iseer)[V|N.](atie)[N|V.] 603 ... </pre> </div> <div lang="en" dir="ltr" class="mw-content-ltr"> ==Can I get a distribution of the suffixes on Dutch adjectives?== </div> <div lang="en" dir="ltr" class="mw-content-ltr"> The [http://hdl.handle.net/10032/tm-a2-h2 eLex lexicon] contains as its third data field the morphology of lemmas. We counted, per lemma id that is an adjective the frequency of the last suffix. For no morphology, we assigned category '0'. </div> <div lang="en" dir="ltr" class="mw-content-ltr"> <pre> 0 11781 (ig) 1781 (achtig) 507 (baar) 473 (isch) 431 (elijk) 392 (end) 367 (en) 292 (s) 278 (lijk) 237 (erig) 229 (ief) 168 (aal) 155 (loos) 138 (d) 116 ... </pre> </div> <div lang="en" dir="ltr" class="mw-content-ltr"> ==I want to use the CGN wave files, but found a dead link on the original website== </div> <div lang="en" dir="ltr" class="mw-content-ltr"> There is a permanent link for the CGN wave file download page: [http://hdl.handle.net/10032/tm-a2-k6 http://hdl.handle.net/10032/tm-a2-k6] </div> <div lang="en" dir="ltr" class="mw-content-ltr"> ==Which treebanks are available for Dutch?== </div> <div lang="en" dir="ltr" class="mw-content-ltr"> We have added the [[Treebanks]] page to this wiki to answer this question. </div> <div lang="en" dir="ltr" class="mw-content-ltr"> ==Is there a corpus with imperative sentences?== </div> <div lang="en" dir="ltr" class="mw-content-ltr"> There is no such explicit corpus available. If we provide [https://kdutch.ivdnt.org/wiki/Treebank_querying#GrETEL GrETEL] with an imperative example we can extract similar sentences, which should be usable as an imperative corpus. </div> <div lang="en" dir="ltr" class="mw-content-ltr"> ==We will run some topic modeling analyses on some Flemish/Belgian Dutch data we have. Because our data set is relatively small for this kind of task, the idea is to train the topic model on a much larger corpus (e.g. social media posts). Do you know of any such corpus that might be available? == </div> <div lang="en" dir="ltr" class="mw-content-ltr"> Take a look at [https://kdutch.ivdnt.org/wiki/K-Dutch#Corpora] </div> <div lang="en" dir="ltr" class="mw-content-ltr"> ==How can I calculate the readability of Dutch text?== </div> <div lang="en" dir="ltr" class="mw-content-ltr"> There is a tool called [https://tscan.hum.uu.nl/tscan/ T-scan] that may be helpful there. </div> <div lang="en" dir="ltr" class="mw-content-ltr"> ==How can I calculate Flesch-Douma for Dutch?== </div> <div lang="en" dir="ltr" class="mw-content-ltr"> The formula for Flesch-Douma requires two things to be counted: number of words in a sentence and number of syllables per word. While the number of words in a sentence is easily counted with any scripting language, the nr of syllables may seem more difficult. The [ http://hdl.handle.net/10032/tm-a2-h2 e-Lex] lexicon contains hyphenation patterns and hence the number of syllables per word. An alternative is to count the number of vowel clusters in each word using regular expressions, which should also give you the number of syllables. </div> <div lang="en" dir="ltr" class="mw-content-ltr"> ==I am looking for a parallel corpus of Dutch-Turkish texts.== </div> <div lang="en" dir="ltr" class="mw-content-ltr"> We are comparing the Dutch and Turkish translations of the Linguistic Inquiry and Word Count [LIWC] dictionaries. Do you know of any corpora that would be suitable? I found several candidates on OPUS (https://opus.nlpl.eu/), and downloaded the TED2020 talks. However these are .xml files with paragraph/line IDs and I need .txt files. Would you have a script or a way to automatically recode them and remove the unnecessary tags? </div> <div lang="en" dir="ltr" class="mw-content-ltr"> We would also refer you to OPUS, you can find parallel txt files if you download the moses format -- then you get a zip which contains a .nl and a .tr file, and these are sentence aligned. i.e. the same line number in the two files should be translations of each other. </div> <div lang="en" dir="ltr" class="mw-content-ltr"> ==Do you know, are there any reasonable sentiment analysis algorithms/approaches for Dutch?== </div> <div lang="en" dir="ltr" class="mw-content-ltr"> We have now added a page on [[sentiment analysis]] to this wiki. </div> <div lang="en" dir="ltr" class="mw-content-ltr"> ==Are there any spoken corpora available of spontaneous speech with time stamped transcriptions, which are freely available?== </div> <div lang="en" dir="ltr" class="mw-content-ltr"> On [https://kdutch.ivdnt.org/wiki/Spoken_corpora] we've collected what is available for Dutch. </div> <div lang="en" dir="ltr" class="mw-content-ltr"> The Corpus Gesproken Nederlands (CGN) has a section of spontaneous speech, with time-stamped transcriptions, freely available. </div> <div lang="en" dir="ltr" class="mw-content-ltr"> ==How can I combine search for "green" and "red" word order in OpenSonar?== </div> <div lang="en" dir="ltr" class="mw-content-ltr"> In principle it is possible to ask for both orders at the same time, see the example for more info. </div> <div lang="en" dir="ltr" class="mw-content-ltr"> *[https://portal.clarin.inl.nl/opensonar_frontend/opensonar/search/hits?filter=Corpus_title%3A%28%22CGN%22%29&first=0&number=20&patt=%5Blemma%3D%22hebben%7Czijn%22+%26+pos_head%3D%22ww%22%26pos_wvorm%3D%22pv%22%5D%5Bpos_head%3D%22ww%22%26pos_wvorm%3D%22vd%22%5D+%7C+%5Bpos_head%3D%22ww%22%26pos_wvorm%3D%22vd%22%5D%5Blemma%3D%22hebben%7Czijn%22+%26+pos_head%3D%22ww%22%26pos_wvorm%3D%22pv%22%5D&interface=%7B%22form%22%3A%22search%22%2C%22patternMode%22%3A%22expert%22%7D Example] </div> <div lang="en" dir="ltr" class="mw-content-ltr"> ==Can you give advice on setting up a transcription process for a spoken language corpus?== </div> <div lang="en" dir="ltr" class="mw-content-ltr"> A meeting was held in which we discussed the use of speech recognition, segmentation, speaker diarisation, and post-editing of speech recognition. We have given the advice to include K-Dutch into the project proposal so that K-Dutch can take care of converting ASR output to ELAN tiers, merging of ELAN tiers etc. </div> <div lang="en" dir="ltr" class="mw-content-ltr"> ==What are the character n-gram frequencies for Dutch?== </div> <div lang="en" dir="ltr" class="mw-content-ltr"> We have counted the n-gram frequencies up to trigrams and made them available at [[Character_N-grams]]. </div> <div lang="en" dir="ltr" class="mw-content-ltr"> ==Is there an user friendly interface for working with the EMEA part of the Lassy-Large treebank?== </div> <div lang="en" dir="ltr" class="mw-content-ltr"> Lassy Large is extremely large, and therefore not entirely available through online query tools such as GrETEL and PaQu, although the latter provides access to the newspaper part. A suggestion is to download the data and then import it into an xml database engine, such as [https://basex.org/ Basex] which will allow you to query it with Xpath and Xquery. </div> <div lang="en" dir="ltr" class="mw-content-ltr"> ==What are good POS taggers for non-standard Dutch?== </div> <div lang="en" dir="ltr" class="mw-content-ltr"> See [[Basic_language_processing]] page. </div> <div lang="en" dir="ltr" class="mw-content-ltr"> ==Are there any lexical profiling tools for Dutch?== </div> <div lang="en" dir="ltr" class="mw-content-ltr"> The request is whether there are any user friendly and freely available tools with which teachers can assess the lexical profile of a text (to which frequency levels do the words belong, how many of the most frequent words should a reader know to understand 95% etc.) </div> <div lang="en" dir="ltr" class="mw-content-ltr"> While we are not aware of any tools that do such a thing explicitly, there are a number of tools that go partly that way. We suggest taking a look at *[https://lint.gebruikercentraal.nl/over-lint/ LINT] which assesses the readability of a text *[https://tscan.hum.uu.nl/tscan/ T-scan] which is what LINT is based on, and which is a bit less user friendly </div> <div lang="en" dir="ltr" class="mw-content-ltr"> There are also a number of tools that can be found at [https://ilt.kuleuven.be/aanbod/index.php Instituut voor Levende Talen] </div> <div lang="en" dir="ltr" class="mw-content-ltr"> == Is there a word list of difficult words for Dutch? == </div> <div lang="en" dir="ltr" class="mw-content-ltr"> While this question is about ''difficult'' words, we mainly have word lists of easy words at our disposal: </div> <div lang="en" dir="ltr" class="mw-content-ltr"> * There is the BasiLex lexicon that contains words aimed at children up to 12 years old * There is the BasiScript lexicon that contains words produced by children up to 12 years old * There are frequency lists available for a number of different corpora, each listing the 5000 most frequent words: http://hdl.handle.net/10032/tm-a2-f8 * The NT2Lex provides a lexicon with words and how they occur in texts aimed at specific CEFR levels. </div> <div lang="en" dir="ltr" class="mw-content-ltr"> If a word does not occur in a list of easy words, that could be used as a measure for word difficulty. But, difficulty is determined more than just by words, i.e. there is also syntactic complexity. </div> <div lang="en" dir="ltr" class="mw-content-ltr"> You could use e.g. the T-scan tool that measures complexity of texts. </div> <div lang="en" dir="ltr" class="mw-content-ltr"> == Is there a text only version of the Corpus Spoken Dutch? == </div> <div lang="en" dir="ltr" class="mw-content-ltr"> The download files of the Corpus Spoken Dutch (CGN) do not contain the text only. The <code>ort</code> files contain ortographic transcriptions and timestamps and the <code>plk</code> files contain part-of-speech and lemma information. The following perl script takes a list of plk files as input and prints the text. If you run this script from the command line in your terminal, then you can create text files. </div> <div lang="en" dir="ltr" class="mw-content-ltr"> <code>binmode(STDOUT, ":utf8");</code> </div> <div lang="en" dir="ltr" class="mw-content-ltr"> <code>while ($in=shift(@ARGV)) {</code> </div> <div lang="en" dir="ltr" class="mw-content-ltr"> <code> print STDERR "Processing $in\n";</code> </div> <div lang="en" dir="ltr" class="mw-content-ltr"> <code> open(F,$in);</code> </div> <div lang="en" dir="ltr" class="mw-content-ltr"> <code> while (<F>) {</code> </div> <div lang="en" dir="ltr" class="mw-content-ltr"> <code> if (/^<au/) {</code> </div> <div lang="en" dir="ltr" class="mw-content-ltr"> <code> if (@words>0) {</code> </div> <div lang="en" dir="ltr" class="mw-content-ltr"> <code> print join(" ",@words)."\n";</code> </div> <div lang="en" dir="ltr" class="mw-content-ltr"> <code> }</code> </div> <div lang="en" dir="ltr" class="mw-content-ltr"> <code> @words=();</code> </div> <div lang="en" dir="ltr" class="mw-content-ltr"> <code> }</code> </div> <div lang="en" dir="ltr" class="mw-content-ltr"> <code> else {</code> </div> <div lang="en" dir="ltr" class="mw-content-ltr"> <code> ($word,$tag,$lem)=split(/\t/);</code> </div> <div lang="en" dir="ltr" class="mw-content-ltr"> <code> push(@words,$word);</code> </div> <div lang="en" dir="ltr" class="mw-content-ltr"> <code> }</code> </div> <div lang="en" dir="ltr" class="mw-content-ltr"> <code> }</code> </div> <div lang="en" dir="ltr" class="mw-content-ltr"> <code>}</code> </div> <div lang="en" dir="ltr" class="mw-content-ltr"> ==Where can I find when which recording was made for the Corpus Spoken Dutch?== </div> <div lang="en" dir="ltr" class="mw-content-ltr"> The CGN contains a metadata-spreadsheet that contains information in what year each of the recordings was made. </div>
Navigation menu
Personal tools
English
Log in
Namespaces
Translate
English
Views
Language statistics
Message group statistics
Export
More
Search
Navigation
Main page
Recent changes
Random page
Help about MediaWiki
Tools
Special pages
Printable version