DEFT corpus
收藏数据集概述
数据集名称
DEFT corpus
数据集描述
DEFT corpus 是针对复杂定义提取任务的最大专家标注语料库。此数据集与 SemEval 2020 Task 6 (DeftEval) 相关联,目前提供训练和开发数据,测试数据将在2020年2月2日SemEval评估期结束后发布。数据来源于 https://cnx.org 的教科书文本。
数据集版本更新
最新版本更新于2020年1月16日。
数据格式
数据采用CoNLL 2003格式,具体结构如下:
TOKEN TXT_SOURCE_FILE START_CHAR END_CHAR TAG TAG_ID ROOT_ID RELATION
许可证信息
数据集根据 CC BY-NC-SA 4.0 许可发布,商业使用需联系作者。
引用信息
若在出版物中使用此数据集,请引用以下文献:
@inproceedings{spala-etal-2019-deft, title = "{DEFT}: A corpus for definition extraction in free- and semi-structured text", author = "Spala, Sasha and Miller, Nicholas A. and Yang, Yiming and Dernoncourt, Franck and Dockhorn, Carl", booktitle = "Proceedings of the 13th Linguistic Annotation Workshop", month = aug, year = "2019", address = "Florence, Italy", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/W19-4015", pages = "124--131", abstract = "Definition extraction has been a popular topic in NLP research for well more than a decade, but has been historically limited to well-defined, structured, and narrow conditions. In reality, natural language is messy, and messy data requires both complex solutions and data that reflects that reality. In this paper, we present a robust English corpus and annotation schema that allows us to explore the less straightforward examples of term-definition structures in free and semi-structured text.", }




