Newspaper corpora: Difference between revisions

From Clarin K-Centre
Jump to navigation Jump to search
No edit summary
No edit summary
Line 9: Line 9:
* [https://taalmaterialen.ivdnt.org/download/tstc-sumnl-samenvattingencorpus/ Download page]
* [https://taalmaterialen.ivdnt.org/download/tstc-sumnl-samenvattingencorpus/ Download page]


* [[Wablieft corpus]]: easy language
==Wablieft corpus: easy language
The Wablieft corpus contains the digital archive of the Wablieft newspaper (period 2011-2017), as also available on the website http://www.wablieft.be/krant/archief.
 
It contains 2 million words of newspaper material in easy to read Dutch. Metadata is available regarding the newspaper section (interior, sport, ...) and the publication date. This concerns all material since the newspaper became fully available digitally and online, from 2011 to December 2017.
 
The data is available in different formats: original text files, text files with one sentence per line, annotated with Frog (POS tagging, lemmatisation, morphology, named entity recognition, chunking, dependency relationships) in FoLiA or CoNNL, and analyzed syntactically with Alpino, in Alpino-XML.
 
There is an agreement with Wablieft for the distribution of this material for non-commercial purposes. Commercial parties can contact Wablieft to obtain a license for the material.
 
* 2011-2017 archive of easy language newspaper in Belgian Dutch.
* tagged, lemmatized, parsed, available in several file formats
* version 1.2
* [https://limo.libis.be/primo-explore/fulldisplay?docid=LIRIAS2859003&context=L&vid=Lirias&search_scope=Lirias&tab=default_tab&lang=en_US&fromSitemap=1 Vincent Vandeghinste, Bram Bulté & Liesbeth Augustinus (2019).  Wablieft: An Easy-to-Read Newspaper corpus for Dutch. In ''CLARIN Annual Conference 2019 Proceedings''. pp.188-191. Leipzig, Germany.]
* [https://taalmaterialen.ivdnt.org/download/tstc-wablieft-corpus-1-2/ Download page]
 
* [[Corpus VU-DNC (VU University Diachronic News text Corpus)]]
* [[Corpus VU-DNC (VU University Diachronic News text Corpus)]]

Revision as of 09:59, 2 March 2021

Newspaper corpora are corpora which exclusively consist of newspaper material.

SumNL: summary-corpus

The SumNL summary corpus is based on 30 clusters. Each cluster consists of a topic and 5-25 newspaper articles relevant to the topic. For each cluster two summaries of different sizes and also extracts consisting of ten sentences from the texts were made.

==Wablieft corpus: easy language The Wablieft corpus contains the digital archive of the Wablieft newspaper (period 2011-2017), as also available on the website http://www.wablieft.be/krant/archief.

It contains 2 million words of newspaper material in easy to read Dutch. Metadata is available regarding the newspaper section (interior, sport, ...) and the publication date. This concerns all material since the newspaper became fully available digitally and online, from 2011 to December 2017.

The data is available in different formats: original text files, text files with one sentence per line, annotated with Frog (POS tagging, lemmatisation, morphology, named entity recognition, chunking, dependency relationships) in FoLiA or CoNNL, and analyzed syntactically with Alpino, in Alpino-XML.

There is an agreement with Wablieft for the distribution of this material for non-commercial purposes. Commercial parties can contact Wablieft to obtain a license for the material.