|
|
Line 12: |
Line 12: |
|
| |
|
| ==[[Simplification Data]]== | | ==[[Simplification Data]]== |
| ===ASSET Simplification Corpus===
| |
| The ASSET simplification corpus (Alva-Manchego et al, 2020) was automatically translated to Dutch (Seidl et al., 2023), and is freely available.
| |
|
| |
| *[https://github.com/tsei902/simplify_dutch/tree/main/resources/datasets/asset Github download]
| |
| * <small>Alva-Manchego, F., Martin, L., Bordes, A., Scarton, C., Sagot, B., & Specia, L. (2020). ASSET: A dataset for tuning and evaluation of sentence simplification models with multiple rewriting transformations. arXiv preprint arXiv:2005.00481.</small>
| |
| * <small>Seidl, T., Vandeghinste, V., & Van de Cruys, T. (2023). [https://kuleuven.limo.libis.be/discovery/fulldisplay?docid=alma9993527112601488&context=L&vid=32KUL_KUL:KULeuven&lang=en&search_scope=All_Content&adaptor=Local%20Search%20Engine&tab=all_content_tab&query=any,contains,seidl%20theresa&offset=0 Controllable Sentence Simplification in Dutch]. KU Leuven. Faculteit Ingenieurswetenschappen.</small>
| |
|
| |
| ===Wikilarge Dataset===
| |
| Automatic translation of the Wikilarge dataset, useful for automatic simplification (Seidl et al., 2023), freely available. Original dataset from Zhang & Lapata
| |
|
| |
| *[https://github.com/tsei902/simplify_dutch/tree/main/resources/datasets/wikilarge Github download]
| |
| * <small>Seidl, T., Vandeghinste, V., & Van de Cruys, T. (2023). [https://kuleuven.limo.libis.be/discovery/fulldisplay?docid=alma9993527112601488&context=L&vid=32KUL_KUL:KULeuven&lang=en&search_scope=All_Content&adaptor=Local%20Search%20Engine&tab=all_content_tab&query=any,contains,seidl%20theresa&offset=0 Controllable Sentence Simplification in Dutch]. KU Leuven. Faculteit Ingenieurswetenschappen.</small>
| |
| * <small>Zhang, X. & Lapata, M. (2017). Sentence Simplification with Deep Reinforcement Learning. In ''Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing'', pages 584–594, Copenhagen, Denmark. Association for Computational Linguistics.</small>
| |
|
| |
| ===Comparable Corpus Wablieft De Standaard===
| |
| *[https://github.com/nivack/comparable_corpus_Wablieft_deStandaard Github]
| |
| *[https://kuleuven.limo.libis.be/discovery/fulldisplay?docid=alma9993153812401488&context=L&vid=32KUL_KUL:KULeuven&lang=en&search_scope=All_Content&adaptor=Local%20Search%20Engine&tab=all_content_tab&query=any,contains,nick%20vanackere&offset=0 Vanackere, N., & Vandeghinste, V. (2022). Building a comparable corpus between easy-to-read Dutch Wablieft and De Standaard. KU Leuven. Faculteit Ingenieurswetenschappen.]
| |
|
| |
| ===UWV Leesplank NL wikipedia===
| |
| The set contains 2,391,206 pragraphs of prompt/result combinations, where the prompt is a paragraph from Dutch Wikipedia and the result is a simplified text, which could include more than one paragraph. This dataset was created by UWV, as a part of project "Leesplank", an effort to generate datasets that are ethically and legally sound.
| |
|
| |
| * [https://huggingface.co/datasets/UWV/Leesplank_NL_wikipedia_simplifications/blob/main/README.md HuggingFace ReadMe file]
| |
|
| |
| *[https://huggingface.co/datasets/UWV/Leesplank_NL_wikipedia_simplifications Dataset]
| |