Newspaper corpora: Difference between revisions
No edit summary |
No edit summary |
||
(10 intermediate revisions by 3 users not shown) | |||
Line 1: | Line 1: | ||
<languages/> | |||
<translate> | |||
== | <!--T:1--> | ||
The | ==SoNaR corpus== | ||
The SoNaR Corpus has a newspaper component (WR-P-P-G) containing nearly 15 million sentences. See also [[Reference_corpora]]. | |||
* | * [https://opensonar.ivdnt.org Online Search] | ||
* [http://hdl.handle.net/10032/tm-a2-h5 Download page] | |||
*[http://hdl.handle.net/10032/tm-a2- | * [http://lands.let.ru.nl/projects/SoNaR/ Project page] | ||
<!--T:2--> | |||
==SumNL: summary-corpus== | ==SumNL: summary-corpus== | ||
The SumNL summary corpus is based on 30 clusters. Each cluster consists of a topic and 5-25 newspaper articles relevant to the topic. For each cluster two summaries of different sizes and also extracts consisting of ten sentences from the texts were made. | The SumNL summary corpus is based on 30 clusters. Each cluster consists of a topic and 5-25 newspaper articles relevant to the topic. For each cluster two summaries of different sizes and also extracts consisting of ten sentences from the texts were made. | ||
<!--T:3--> | |||
* version 1.0.1 | * version 1.0.1 | ||
* data set from 2014 | * data set from 2014 | ||
Line 16: | Line 18: | ||
* [http://hdl.handle.net/10032/tm-a2-h7 Download page] | * [http://hdl.handle.net/10032/tm-a2-h7 Download page] | ||
<!--T:4--> | |||
==Wablieft corpus: easy language== | ==Wablieft corpus: easy language== | ||
The Wablieft corpus contains the digital archive of the Wablieft newspaper (period 2011-2017), as also available on the website http://www.wablieft.be/krant/archief. | The Wablieft corpus contains the digital archive of the Wablieft newspaper (period 2011-2017), as also available on the website http://www.wablieft.be/krant/archief. | ||
<!--T:5--> | |||
It contains 2 million words of newspaper material in easy to read Dutch. Metadata is available regarding the newspaper section (interior, sport, ...) and the publication date. This concerns all material since the newspaper became fully available digitally and online, from 2011 to December 2017. | It contains 2 million words of newspaper material in easy to read Dutch. Metadata is available regarding the newspaper section (interior, sport, ...) and the publication date. This concerns all material since the newspaper became fully available digitally and online, from 2011 to December 2017. | ||
<!--T:6--> | |||
The data is available in different formats: original text files, text files with one sentence per line, annotated with Frog (POS tagging, lemmatisation, morphology, named entity recognition, chunking, dependency relationships) in FoLiA or CoNNL, and analyzed syntactically with Alpino, in Alpino-XML. | The data is available in different formats: original text files, text files with one sentence per line, annotated with Frog (POS tagging, lemmatisation, morphology, named entity recognition, chunking, dependency relationships) in FoLiA or CoNNL, and analyzed syntactically with Alpino, in Alpino-XML. | ||
<!--T:7--> | |||
There is an agreement with Wablieft for the distribution of this material for non-commercial purposes. Commercial parties can contact Wablieft to obtain a license for the material. | There is an agreement with Wablieft for the distribution of this material for non-commercial purposes. Commercial parties can contact Wablieft to obtain a license for the material. | ||
<!--T:8--> | |||
* 2011-2017 archive of easy language newspaper in Belgian Dutch. | * 2011-2017 archive of easy language newspaper in Belgian Dutch. | ||
* tagged, lemmatized, parsed, available in several file formats | * tagged, lemmatized, parsed, available in several file formats | ||
Line 31: | Line 38: | ||
* [http://hdl.handle.net/10032/tm-a2-q6 Download page] | * [http://hdl.handle.net/10032/tm-a2-q6 Download page] | ||
<!--T:9--> | |||
==WAI-NOT Corpus == | |||
The WAI-NOT Corpus contains the digital archive of the [https://www.wai-not.be/page/10 WAI-NOT newspaper] (period 2009-2021). The newspaper articles are written in easy to read Dutch. | |||
<!--T:10--> | |||
* 2009-2021 archive of easy language newspaper in Belgian Dutch | |||
* version 1.0 | |||
* [http://hdl.handle.net/10032/tm-a2-t9 Download page] | |||
<!--T:11--> | |||
==Corpus VU-DNC (VU University Diachronic News text Corpus)== | ==Corpus VU-DNC (VU University Diachronic News text Corpus)== | ||
The VU-DNC Corpus is a diachronic Dutch newspaper corpus (VU Free University Dutch Newspaper Corpus). | The VU-DNC Corpus is a diachronic Dutch newspaper corpus (VU Free University Dutch Newspaper Corpus). | ||
<!--T:12--> | |||
The corpus consists of data from five newspapers: Algemeen Dagblad, NRC (Handelsblad), De Telegraaf, Trouw and De Volkskrant. For each of the newspapers, data of two years (1950/1951 and 2002) are available. The articles were selected by topic (e.g. headline news, foreign news and sports). Special feature of the corpus is that both the presence of subjective elements in the articles and the presence of direct speech have been annotated. The subjective elements are annotated based on a set of lexical elements (subjectivity lexicon). As a result, the corpus is very useful to linguistically oriented researchers who are interested in diachrony and/or subjectivity and to communication scientists and media scholars who are interested in changing practices regarding the framing of coverage. | The corpus consists of data from five newspapers: Algemeen Dagblad, NRC (Handelsblad), De Telegraaf, Trouw and De Volkskrant. For each of the newspapers, data of two years (1950/1951 and 2002) are available. The articles were selected by topic (e.g. headline news, foreign news and sports). Special feature of the corpus is that both the presence of subjective elements in the articles and the presence of direct speech have been annotated. The subjective elements are annotated based on a set of lexical elements (subjectivity lexicon). As a result, the corpus is very useful to linguistically oriented researchers who are interested in diachrony and/or subjectivity and to communication scientists and media scholars who are interested in changing practices regarding the framing of coverage. | ||
[https://ivdnt.org/wp-content/apps/vu-dnc/index.html Corpus website] | <!--T:13--> | ||
* [https://ivdnt.org/wp-content/apps/vu-dnc/index.html Corpus website] | |||
</translate> |
Latest revision as of 12:39, 13 March 2024
SoNaR corpus
The SoNaR Corpus has a newspaper component (WR-P-P-G) containing nearly 15 million sentences. See also Reference_corpora.
SumNL: summary-corpus
The SumNL summary corpus is based on 30 clusters. Each cluster consists of a topic and 5-25 newspaper articles relevant to the topic. For each cluster two summaries of different sizes and also extracts consisting of ten sentences from the texts were made.
- version 1.0.1
- data set from 2014
- 1.60 MB
- Download page
Wablieft corpus: easy language
The Wablieft corpus contains the digital archive of the Wablieft newspaper (period 2011-2017), as also available on the website http://www.wablieft.be/krant/archief.
It contains 2 million words of newspaper material in easy to read Dutch. Metadata is available regarding the newspaper section (interior, sport, ...) and the publication date. This concerns all material since the newspaper became fully available digitally and online, from 2011 to December 2017.
The data is available in different formats: original text files, text files with one sentence per line, annotated with Frog (POS tagging, lemmatisation, morphology, named entity recognition, chunking, dependency relationships) in FoLiA or CoNNL, and analyzed syntactically with Alpino, in Alpino-XML.
There is an agreement with Wablieft for the distribution of this material for non-commercial purposes. Commercial parties can contact Wablieft to obtain a license for the material.
- 2011-2017 archive of easy language newspaper in Belgian Dutch.
- tagged, lemmatized, parsed, available in several file formats
- version 1.2
- Vincent Vandeghinste, Bram Bulté & Liesbeth Augustinus (2019). Wablieft: An Easy-to-Read Newspaper corpus for Dutch. In CLARIN Annual Conference 2019 Proceedings. pp.188-191. Leipzig, Germany.
- Download page
WAI-NOT Corpus
The WAI-NOT Corpus contains the digital archive of the WAI-NOT newspaper (period 2009-2021). The newspaper articles are written in easy to read Dutch.
- 2009-2021 archive of easy language newspaper in Belgian Dutch
- version 1.0
- Download page
Corpus VU-DNC (VU University Diachronic News text Corpus)
The VU-DNC Corpus is a diachronic Dutch newspaper corpus (VU Free University Dutch Newspaper Corpus).
The corpus consists of data from five newspapers: Algemeen Dagblad, NRC (Handelsblad), De Telegraaf, Trouw and De Volkskrant. For each of the newspapers, data of two years (1950/1951 and 2002) are available. The articles were selected by topic (e.g. headline news, foreign news and sports). Special feature of the corpus is that both the presence of subjective elements in the articles and the presence of direct speech have been annotated. The subjective elements are annotated based on a set of lexical elements (subjectivity lexicon). As a result, the corpus is very useful to linguistically oriented researchers who are interested in diachrony and/or subjectivity and to communication scientists and media scholars who are interested in changing practices regarding the framing of coverage.