coref-data/preco_raw
收藏PreCo 数据集
概述
PreCo 是一个用于指代消解的大规模英语数据集。该数据集旨在通过减少训练集和测试集之间的低重叠问题,并支持提及检测和提及聚类的分离分析,来体现指代消解中的核心挑战,如实体表示。
数据集详情
- 数据来源: 38K 文档和 12.5M 单词,主要来自英语母语的学龄前儿童词汇。
- 实验结果: 与现有的流行数据集 OntoNotes 相比,PreCo 在训练-测试重叠度更高的情况下,错误分析更为高效。
- 单例提及: 通过标注单例提及,首次量化了提及检测器对指代消解性能的影响。
数据格式
- 文件类型: 包含两个 JSON 行文件,分别用于训练和开发集。
- 文件内容: 每行是一个 JSON 字符串,编码一个文档。
- 字段说明:
- "id": 文档的字符串标识符。
- "sentences": 文本内容,包含句子列表,每个句子包含单词或标点符号的列表。
- "mention_clusters": 文档的提及聚类,包含提及聚类列表,每个提及聚类包含提及列表,每个提及是一个整数元组 [sentence_idx, begin_idx, end_idx]。
引用
@inproceedings{chen-etal-2018-preco, title = "{P}re{C}o: A Large-scale Dataset in Preschool Vocabulary for Coreference Resolution", author = "Chen, Hong and Fan, Zhenhua and Lu, Hao and Yuille, Alan and Rong, Shu", editor = "Riloff, Ellen and Chiang, David and Hockenmaier, Julia and Tsujii, Jun{}ichi", booktitle = "Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing", month = oct # "-" # nov, year = "2018", address = "Brussels, Belgium", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/D18-1016", doi = "10.18653/v1/D18-1016", pages = "172--181", abstract = "We introduce PreCo, a large-scale English dataset for coreference resolution. The dataset is designed to embody the core challenges in coreference, such as entity representation, by alleviating the challenge of low overlap between training and test sets and enabling separated analysis of mention detection and mention clustering. To strengthen the training-test overlap, we collect a large corpus of 38K documents and 12.5M words which are mostly from the vocabulary of English-speaking preschoolers. Experiments show that with higher training-test overlap, error analysis on PreCo is more efficient than the one on OntoNotes, a popular existing dataset. Furthermore, we annotate singleton mentions making it possible for the first time to quantify the influence that a mention detector makes on coreference resolution performance. The dataset is freely available at url{https://preschool-lab.github.io/PreCo/}.", }




