EnSiWiki-2020 corpus for textual complexity modelling
收藏Figshare2025-01-30 更新2026-04-28 收录
下载链接:
https://figshare.com/articles/dataset/EnSiWiki-2020_corpus_for_textual_complexity_modelling/25676209
下载链接
链接失效反馈官方服务:
资源简介:
The EnSiWiki-2020 Corpus is designed for modeling textual complexity using paired articles from Simple English and standard English Wikipedia. It employs two key sampling methods: age-based ranking and pair-based sampling. Age-based ranking ensures the inclusion of mature, high-quality articles while controlling for differing Wikipedia growth rates. Pair-based sampling links Simple-English articles with their standard English counterparts, enabling resampling and train-test splits at the group (pair) level. This minimizes overfitting to specific topics and helps disentangle complexity from meaning. These characteristics make EnSiWiki-2020 well-suited for training models to capture readability and textual complexity.DescriptionThe EnSiWiki dataset comprises Wikipedia pages sampled from Simple-English and standard English Wikipedia as of April 1, 2020. The dataset includes a total of 311,656 pages, with metadata including the age rank of each pair. This metadata facilitates the selection of matured pairs, which are more likely to reflect the intended quality level and writing style of both Wikipedia versions.Data File and StructureData are stored in in `.db` (SQLite3) format, facilitating easy access of the text data. The database file ensiwiki2020.db contains two tables: Pages and Pairs. Pages details individual articles, Pairs links Simple English articles with their standard English counterparts. Specific columns are:Table Pages:`id`: Article identifier.`page_id`: Article identifier used by Wikipedia.`lang`: Either 'english' or 'simple'.`title`, `text`: Full-text of Wikipedia articles, parsed using the JWPL MediaWiki parser (see https://dkpro.github.io/dkpro-jwpl/JWPLParser/).`pairs_id`: Pair identifier.`simple_agerank`, `english_agerank`: Rank of articles after sorting them by age, based on their first-revision date.`agerank`: Mean rank of both simple and english ranks.`is_disambiguation`, `is_discussion`, `redirect_count`, `stub_count`: Metadata columns summarizing Wikipedia tags of disambiguation, discussion, redirect, and stub pages aggregated per pair.`min_word_count`, `min_sentence_count`: The smallest word count and sentence count of either of the articles within a pair.Table Pairs:`id`: Pair identifier.`simple_id`, `english_id`: Article identifiers from table Pages.PurposeThe EnSiWiki-2020 Corpus is designed to detect interactions between various readability features, making it ideal for training textual complexity models. It includes age rank and additional metadata to ensure the quality and maturity of the articles, facilitating its validity for training purposes.LicensingEnSiWiki-2020: Wikipedia articles are shared under the CC-BY 4.0 license, which allows for redistribution and reuse under certain conditions.
创建时间:
2025-01-30



