DocRED

arXiv2025-09-30 收录

下载链接：

https://github.com/thunlp/docred

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集名为DocRED，分为训练、开发和测试三个部分，分别包含3,053、1,000和1,000篇文档。数据集中共有132,375个实体和96种关系类型。人工分析显示，约40.7%的关系事实只能从多个句子中提取，而61.1%的关系实例需要多种推理方法。此外，该数据集的创建旨在通过从关系提取模型和基于实体链接的远端监督生成三元组候选，然后由人工标注者对这些候选进行标注。规模上，该数据集包括3,053篇训练文档、1,000篇开发文档和1,000篇测试文档，其任务是进行文档级关系提取。

The dataset is named DocRED, which is divided into three subsets: training, development, and test, with 3,053, 1,000, and 1,000 documents respectively. It contains a total of 132,375 entities and 96 relation types. Manual analysis demonstrates that roughly 40.7% of relational facts can only be extracted from multiple sentences, while 61.1% of relation instances require diverse reasoning approaches. Additionally, this dataset was constructed by generating triplet candidates via distant supervision leveraging relation extraction models and entity linking, followed by manual annotation of these candidates by human annotators. In terms of scale, the dataset includes 3,053 training documents, 1,000 development documents, and 1,000 test documents, and its target task is document-level relation extraction.

搜集汇总

背景与挑战

背景概述

DocRED是一个大规模文档级关系抽取数据集，基于Wikipedia和Wikidata构建，标注了命名实体和关系，要求综合文档信息进行关系推断，并提供了远程监督数据支持多种学习场景。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集