Corpus querying: Difference between revisions

Latest revision as of 10:04, 19 June 2025

Autosearch

This demonstrator allows users to define one or more corpora and upload data for the corpora, after which the corpora will be made automatically searchable in a private workspace.

Users can upload text data annotated with lemma + part of speech tags in TEI or FoLiA format, either as a single XML file or as an archive (zip or tar.gz) containing several XML files. Corpus size is limited to begin with (25 MB limit per uploaded file; 500,000 token limit for an entire corpus), but these limits may be increased at a later point in time. The search application is powered by the INT BlackLab corpus search engine. The search interface is the same as the one used in for example the Corpus of Contemporary Dutch / Corpus Hedendaags Nederlands.

Demonstrator (Only accessible with a CLARIN account)

OpenSonar

The current application contains two corpora:

The SoNaR corpus (See Reference corpora for more information.)
The Corpus of Spoken Dutch (Corpus Gesproken Nederlands, CGN) is a collection of 900 hours (almost 9 million words) of contemporary Dutch speech, originating from Flemish and Dutch speakers. The speech fragments (spontaneous and prepared) are aligned with various transcriptions (including orthographic, phonetic) and annotations (lemma, POS tags). All annotations have been verified manually, except for the phonetic transcription: only 11,3% was verified. The corpus data are available for researchers, see here.

The application has been developed in the CLARIN-NL and CLARIAH projects by a joint team of the Dutch Language Institute, Tilburg University and Radboud University.

The application is a web-based frontend for the BlackLab search engine for corpora with token-based annotation. The current frontend is a further development of the corpus-frontend application developed by INT (https://github.com/INL/corpus-frontend) and its design is inspired by the first version of the OpenSoNaR user interface by Tilburg and Radboud University (https://github.com/Taalmonsters/WhiteLab2.0).

Website (Only accessible with a CLARIN account)

Corpus Analysis Tools

A comprehensive list of tools used in corpus compilation and analysis.

Website

CLARIN Resource Family on Corpus querying

Corpus Query Tools lists corpus query tools for many different languages.

@@ Line 6: / Line 6: @@
 <!--T:2-->
-Users can upload text data annotated with lemma + part of speech tags in TEI or FoLiA format, either as a single XML file or as an archive (zip or tar.gz) containing several XML files. Corpus size is limited to begin with (25 MB limit per uploaded file; 500,000 token limit for an entire corpus), but these limits may be increased at a later point in time. The search application is powered by the INL BlackLab corpus search engine. The search interface is the same as the one used in for example the [https://chn.ivdnt.org Corpus of Contemporary Dutch / Corpus Hedendaags Nederlands].
+Users can upload text data annotated with lemma + part of speech tags in TEI or FoLiA format, either as a single XML file or as an archive (zip or tar.gz) containing several XML files. Corpus size is limited to begin with (25 MB limit per uploaded file; 500,000 token limit for an entire corpus), but these limits may be increased at a later point in time. The search application is powered by the INT BlackLab corpus search engine. The search interface is the same as the one used in for example the [https://chn.ivdnt.org Corpus of Contemporary Dutch / Corpus Hedendaags Nederlands].
 <!--T:3-->
@@ Line 17: / Line 17: @@
 *The Corpus of Spoken Dutch (Corpus Gesproken Nederlands, CGN) is a collection of 900 hours (almost 9 million words) of contemporary Dutch speech, originating from Flemish and Dutch speakers. The speech fragments (spontaneous and prepared) are aligned with various transcriptions (including orthographic, phonetic) and annotations (lemma, POS tags). All annotations have been verified manually, except for the phonetic transcription: only 11,3% was verified. The corpus data are available for researchers, see [http://hdl.handle.net/10032/tm-a2-k6 here].
-* [https://opensonar.ivdnt.org/ Website] (Only accessible with a CLARIN account)
 <!--T:5-->
@@ Line 26: / Line 25: @@
 <!--T:7-->
+* [https://opensonar.ivdnt.org/ Website] (Only accessible with a CLARIN account)
+<!--T:9-->
 == Corpus Analysis Tools ==
 A comprehensive list of tools used in corpus compilation and analysis.
@@ Line 32: / Line 34: @@
 * [https://corpus-analysis.com/ Website]
 </translate>
+==CLARIN Resource Family on Corpus querying==
+* [https://www.clarin.eu/resource-families/corpus-query-tools Corpus Query Tools] lists corpus query tools for many different languages.