ARTS Datasets - ARTS94, ARTS300, ARTS3000
收藏NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/11371689
下载链接
链接失效反馈官方服务:
资源简介:
Datasets for readability and text simplicity evaluation in three sizes: 94, 300 and 3000 disjunctive data entries. One data entry contains the following information:
Text_original: Text from a parallel corpus for text simplification
Text_formatted: Text_original where formatting issues have been resolved either manually (ARTS94) or automatically (ARTS300 and ARTS3000)
Dataset: Parallel corpus for text simplification, from which the original text has been extracted
Label: information, if the text has been from the simplified (simp) or source (src) part of the corpus
ID: Unique ID
Score: Simplicity/readability score of the formatted text, between 0 and 1, the higher a score, the more complex/less readable the text
Licenses of the different datasets apply for the respective texts.
本数据集涵盖三种规模的可读性与文本简化性评估语料,分别包含94、300及3000条互不重叠的数据条目。单条数据条目包含以下信息:
Text_original:来自文本简化平行语料库的原始文本
Text_formatted:已完成格式问题修正的原始文本,其中ARTS94子集的格式问题通过人工方式修正,ARTS300与ARTS3000子集则通过自动化手段完成修正
Dataset:用于提取原始文本的文本简化平行语料库
Label:标注该文本属于语料库的简化(simp)还是源(src)部分
ID:唯一标识符
Score:格式化后文本的简化度/可读性评分,取值范围为0至1,评分越高则文本复杂度越高、可读性越差
各数据集的专属许可条款适用于对应文本。
创建时间:
2024-05-28



