five

WikiReaD (Wikipedia Readability Dataset)

收藏
NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/11371931
下载链接
链接失效反馈
官方服务:
资源简介:
Dataset Description: The dataset contains pairs of encyclopedic articles in 14 languages. Each pair includes the same article in two levels of readability (easy/hard). The pairs are obtained by matching Wikipedia articles (hard) with the corresponding versions from different simplified or children's encyclopedias (easy).   Dataset Details: Number of Languages: 14 Number of files: 19 Use Case: Training and evaluating readability scoring models for articles within and outside Wikipedia. Processing details: Text pairs are created by matching articles from Wikipedia with the corresponding article in the simplified/children encyclopedia either via the Wikidata item ID or their page titles. The text of each article is extracted directly from their parsed HTML version. Files: The dataset consists of independent files for each type of children/simplified encyclopedia and each language (e.g., `-_sentences.bz2`). Also, the dataset contains train-test split files for  simplewiki-en (trainsplit_simplewiki-en_sentences.bz2, testsplit_simplewiki-en_sentences.bz2) needed to reproduce the results of the corresponding paper.    Attribution: The dataset was compiled from the following sources. The text of the original articles comes from the corresponding language version of Wikipedia. The text of the simplified articles comes from one of the following encyclopedias: Simple English Wikipedia, Vikidia, Klexikon, Txikipedia, or Wikikids. Below we provide information about the license of the original content as well as the template to generate the link to the original source for a given page () and language (). For example, https://en.wikipedia.org/wiki/Spain links to the page “Spain” in English Wikipedia) Wikipedia Source: https://.wikipedia.org/wiki/ License: CC BY-SA 4.0, GFDL Simple English Wikipedia Source: https://simple.wikipedia.org/wiki/ License: CC BY-SA 4.0, GFDL Vikidia Source: https://.vikidia.org/wiki/ License: CC BY-SA 3.0, GFDL Klexikon Source: https://klexikon.zum.de/wiki/ License: CC BY-SA 4.0 Txikipedia Source: https://eu.wikipedia.org/wiki/Txikipedia: License: CC BY-SA 4.0, GFDL Wikikids Source: https://wikikids.nl/ License: CC BY-SA 3.0   Code for data collection: TBD Related paper citation: TBD
创建时间:
2024-07-11
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作