community-datasets/turk
收藏数据集卡片 TURK
数据集描述
数据集摘要
TURK 是一个用于评估英语句子简化任务的多参考数据集。该数据集包含 2,359 个来自 Parallel Wikipedia Simplification (PWKP) corpus 的句子,每个句子有 8 个众包简化版本,仅关注词汇释义(不包括句子拆分或删除)。
支持的任务和排行榜
该数据集不提供排行榜。
语言
TURK 仅包含英语文本(BCP-47: en)。
数据集结构
数据实例
每个实例包含一个原始句子和 8 个可能的参考简化版本,这些简化版本仅关注词汇释义。
json { "original": "one side of the armed conflicts is composed mainly of the sudanese military and the janjaweed, a sudanese militia group recruited mostly from the afro-arab abbala tribes of the northern rizeigat region in sudan.", "simplifications": [ "one side of the armed conflicts is made of sudanese military and the janjaweed, a sudanese militia recruited from the afro-arab abbala tribes of the northern rizeigat region in sudan.", "one side of the armed conflicts consist of the sudanese military and the sudanese militia group janjaweed.", "one side of the armed conflicts is mainly sudanese military and the janjaweed, which recruited from the afro-arab abbala tribes.", "one side of the armed conflicts is composed mainly of the sudanese military and the janjaweed, a sudanese militia group recruited mostly from the afro-arab abbala tribes in sudan.", "one side of the armed conflicts is made up mostly of the sudanese military and the janjaweed, a sudanese militia group whose recruits mostly come from the afro-arab abbala tribes from the northern rizeigat region in sudan.", "the sudanese military and the janjaweed make up one of the armed conflicts, mostly from the afro-arab abbal tribes in sudan.", "one side of the armed conflicts is composed mainly of the sudanese military and the janjaweed, a sudanese militia group recruited mostly from the afro-arab abbala tribes of the northern rizeigat regime in sudan.", "one side of the armed conflicts is composed mainly of the sudanese military and the janjaweed, a sudanese militia group recruited mostly from the afro-arab abbala tribes of the northern rizeigat region in sudan." ] }
数据字段
original: 来自源数据集的原始句子。simplifications: 一组由众包工作者提供的参考简化版本。
数据拆分
TURK 不包含训练集;许多模型使用 WikiLarge 或 Wiki-Auto 进行训练。每个输入句子有 8 个相关的参考简化句子。2,359 个输入句子被随机分为 2,000 个验证句子和 359 个测试句子。
| Dev | Test | Total | |
|---|---|---|---|
| 输入句子 | 2000 | 359 | 2359 |
| 参考简化版本 | 16000 | 2872 | 18872 |
数据集创建
策划理由
TURK 数据集是为了评估文本简化任务而构建的。它包含多个人工编写的参考,仅关注词汇简化。
源数据
初始数据收集和规范化
数据集中的输入句子来自 Parallel Wikipedia Simplification (PWKP) corpus。
源语言生产者
参考来自 Amazon Mechanical Turk 的众包工作者。
标注
标注过程
标注者收到的指示在论文中提供。
标注者
标注者是 Amazon Mechanical Turk 工作者。
个人和敏感信息
由于数据集来自公开的英语维基百科(2009 年 8 月 22 日版本),数据集中的所有信息均已公开。
使用数据的注意事项
数据集的社会影响
该数据集有助于推动文本简化研究,提高书面文档的可访问性。
讨论偏见
数据集可能包含一些社会偏见,因为输入句子基于维基百科。
其他已知限制
由于数据集仅包含 2,359 个来自维基百科的句子,因此仅限于维基百科上的一小部分主题。
附加信息
数据集策展人
TURK 由宾夕法尼亚大学的研究人员开发。
许可信息
GNU General Public License v3.0
引用信息
bibtex @article{Xu-EtAl:2016:TACL, author = {Wei Xu and Courtney Napoles and Ellie Pavlick and Quanze Chen and Chris Callison-Burch}, title = {Optimizing Statistical Machine Translation for Text Simplification}, journal = {Transactions of the Association for Computational Linguistics}, volume = {4}, year = {2016}, url = {https://cocoxu.github.io/publications/tacl2016-smt-simplification.pdf}, pages = {401--415} }
贡献
感谢 @mounicam 添加此数据集。




