Phrase Detectives 3.0 corpus
收藏arXiv2022-10-12 更新2024-08-06 收录
下载链接:
http://arxiv.org/abs/2210.05581v1
下载链接
链接失效反馈官方服务:
资源简介:
Phrase Detectives 3.0 corpus是一个用于指代消解的大型标注语料库,由埃塞克斯大学等机构创建。该数据集包含1.4M个tokens和383K个markables,覆盖了小说和维基百科两种文本类型,特别关注单例和非指称表达。数据集通过一种名为‘游戏与目的’的方法进行标注,结合了人群和自动标注技术,旨在加速标注过程并提高标注质量。该数据集适用于研究长距离指代和长文档训练,解决了现有数据集在规模、领域多样性和文档长度方面的限制。
The Phrase Detectives 3.0 corpus is a large annotated corpus for coreference resolution, developed by institutions including the University of Essex. This dataset contains 1.4 million tokens and 383,000 markables, covering two text categories: fiction and Wikipedia, with a special focus on singleton and non-referential expressions. Annotation was conducted via a 'game with a purpose' approach, combining crowdsourcing and automatic annotation techniques, aiming to accelerate the annotation process and improve annotation quality. This dataset is suitable for research on long-distance coreference resolution and long-document training, addressing the limitations of existing datasets in terms of scale, domain diversity, and document length.
提供机构:
埃塞克斯大学, 英国; 亚马逊研究, 罗马尼亚; 雷根斯堡大学, 德国; 伦敦玛丽女王大学, 英国
创建时间:
2022-10-12



