kinky69/conll04
收藏Hugging Face2026-04-11 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/kinky69/conll04
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: entities
list:
- name: end
dtype: int64
- name: start
dtype: int64
- name: type
dtype: string
- name: tokens
sequence: string
- name: relations
list:
- name: head
dtype: int64
- name: tail
dtype: int64
- name: type
dtype: string
- name: orig_id
dtype: int64
splits:
- name: train
num_bytes: 358752
num_examples: 922
- name: validation
num_bytes: 94688
num_examples: 231
- name: test
num_bytes: 114248
num_examples: 288
download_size: 204955
dataset_size: 567688
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
- split: validation
path: data/validation-*
- split: test
path: data/test-*
task_categories:
- token-classification
language:
- en
tags:
- relation-extraction
pretty_name: CoNLL04
size_categories:
- 1K<n<10K
---
# Dataset Card for CoNLL04
## Dataset Description
- **Repository:** https://github.com/lavis-nlp/spert
- **Paper:** https://aclanthology.org/W04-2401/
- **Benchmark:** https://paperswithcode.com/sota/relation-extraction-on-conll04
#### Dataset Summary
<!-- Provide a quick summary of the dataset. -->
The CoNLL04 dataset is a benchmark dataset used for relation extraction tasks. It contains 1,437 sentences, each of which has at least one relation. The sentences are annotated with information about entities and their corresponding relation types.
The data in this repository was converted from ConLL04 format to JSONL format in https://github.com/lavis-nlp/spert/blob/master/scripts/conversion/convert_conll04.py
The original data can be found here: https://cogcomp.seas.upenn.edu/page/resource_view/43
The sentences in this dataset are tokenized and are annotated with entities (`Peop`, `Loc`, `Org`, `Other`) and relations (`Located_In`, `Work_For`, `OrgBased_In`, `Live_In`, `Kill`).
### Languages
The language in the dataset is English.
## Dataset Structure
<!-- This section provides a description of the dataset fields, and additional information about the dataset structure such as criteria used to create the splits, relationships between data points, etc. -->
### Dataset Instances
An example of 'train' looks as follows:
```json
{
"tokens": ["Newspaper", "`", "Explains", "'", "U.S.", "Interests", "Section", "Events", "FL1402001894", "Havana", "Radio", "Reloj", "Network", "in", "Spanish", "2100", "GMT", "13", "Feb", "94"],
"entities": [
{"type": "Loc", "start": 4, "end": 5},
{"type": "Loc", "start": 9, "end": 10},
{"type": "Org", "start": 10, "end": 13},
{"type": "Other", "start": 15, "end": 17},
{"type": "Other", "start": 17, "end": 20}
],
"relations": [
{"type": "OrgBased_In", "head": 2, "tail": 1}
],
"orig_id": 3255
}
```
### Data Fields
- `tokens`: the text of this example, a `string` feature.
- `entities`: list of entities
- `type`: entity type, a `string` feature.
- `start`: start token index of entity, a `int32` feature.
- `end`: exclusive end token index of entity, a `int32` feature.
- `relations`: list of relations
- `type`: relation type, a `string` feature.
- `head`: index of head entity, a `int32` feature.
- `tail`: index of tail entity, a `int32` feature.
## Citation
<!-- If there is a paper or blog post introducing the dataset, the APA and Bibtex information for that should go in this section. -->
**BibTeX:**
```
@inproceedings{roth-yih-2004-linear,
title = "A Linear Programming Formulation for Global Inference in Natural Language Tasks",
author = "Roth, Dan and
Yih, Wen-tau",
booktitle = "Proceedings of the Eighth Conference on Computational Natural Language Learning ({C}o{NLL}-2004) at {HLT}-{NAACL} 2004",
month = may # " 6 - " # may # " 7",
year = "2004",
address = "Boston, Massachusetts, USA",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/W04-2401",
pages = "1--8",
}
@article{eberts-ulges2019spert,
author = {Markus Eberts and
Adrian Ulges},
title = {Span-based Joint Entity and Relation Extraction with Transformer Pre-training},
journal = {CoRR},
volume = {abs/1909.07755},
year = {2019},
url = {http://arxiv.org/abs/1909.07755},
eprinttype = {arXiv},
eprint = {1909.07755},
timestamp = {Mon, 23 Sep 2019 18:07:15 +0200},
biburl = {https://dblp.org/rec/journals/corr/abs-1909-07755.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
```
**APA:**
- Roth, D., & Yih, W. (2004). A linear programming formulation for global inference in natural language tasks. In Proceedings of the Eighth Conference on Computational Natural Language Learning (CoNLL-2004) at HLT-NAACL 2004 (pp. 1-8). Boston, Massachusetts, USA: Association for Computational Linguistics. https://aclanthology.org/W04-2401
- Eberts, M., & Ulges, A. (2019). Span-based joint entity and relation extraction with transformer pre-training. CoRR, abs/1909.07755. http://arxiv.org/abs/1909.07755
## Dataset Card Authors
[@phucdev](https://github.com/phucdev)
提供机构:
kinky69
搜集汇总
数据集介绍

构建方式
在自然语言处理领域,关系抽取任务需要高质量标注数据支撑。CoNLL04数据集源自计算自然语言学习会议,其构建过程遵循严谨的学术规范。原始数据通过人工标注方式,对新闻文本中的命名实体及实体间语义关系进行精细标记。标注体系涵盖人物、地点、组织和其他四类实体,以及五类典型语义关系。数据经过标准化处理,从原始CoNLL格式转换为结构化JSONL格式,确保机器可读性与学术研究可复现性。
特点
作为关系抽取领域的经典基准数据集,CoNLL04展现出鲜明的专业特征。数据集包含1437个英文句子,每个句子至少包含一组实体关系对,这种设计保证了数据样本的语义密度。标注体系采用双层结构,既标注实体边界与类型,又标注实体间的定向语义关系。数据划分遵循机器学习标准范式,提供训练集、验证集和测试集,支持模型开发与评估的全流程。其紧凑的规模与精细的标注质量,使其成为检验关系抽取模型性能的理想试金石。
使用方法
该数据集主要服务于自然语言处理中联合实体与关系抽取的研究方向。研究人员可通过加载标准化JSONL格式数据,直接获取分词后的文本序列及对应的标注信息。实体标注以字符偏移量形式提供起止位置,关系标注则通过头尾实体索引建立关联。典型应用场景包括训练端到端的关系抽取模型,评估模型在细粒度语义理解任务上的表现。使用时应遵循标准数据划分方案,并参考原始论文中的评估指标,以确保实验结果的可靠性与可比性。
背景与挑战
背景概述
CoNLL04数据集诞生于2004年,由Dan Roth与Wen-tau Yih等研究人员在计算自然语言学习会议(CoNLL)上提出,旨在为关系抽取任务提供标准化评估基准。该数据集聚焦于从非结构化文本中自动识别实体间语义关联,涵盖了人物、地点、组织及其他实体类型,并定义了包括‘位于’、‘工作于’在内的五种核心关系。作为早期关系抽取领域的奠基性资源,CoNLL04推动了全局推理与线性规划方法在自然语言处理中的应用,为后续联合实体与关系抽取模型的发展奠定了数据基础。
当前挑战
在关系抽取领域,CoNLL04数据集所应对的核心挑战在于处理实体间的复杂语义交互,例如同一句子中多个实体可能参与多种关系,且关系类型存在重叠与歧义。构建过程中的挑战则体现在数据标注的精细度要求上,需要人工准确界定实体边界并判定其关系类别,这导致了标注成本高昂且易引入不一致性。此外,数据规模相对有限,仅包含1437个句子,难以充分支持现代深度学习模型对大规模训练数据的需求,从而制约了模型在复杂语境下的泛化能力。
常用场景
经典使用场景
在自然语言处理领域,CoNLL04数据集作为关系抽取任务的经典基准,广泛用于评估模型从文本中识别实体间语义关联的能力。该数据集包含标注了人物、地点、组织等实体类型及其间如“位于”、“工作于”等关系的句子,为研究者提供了结构化的训练与测试样本。通过这一数据集,能够系统地考察模型在复杂语境下捕捉实体交互模式的性能,推动了信息抽取技术的发展。
实际应用
在实际应用中,CoNLL04数据集支撑了新闻分析、情报挖掘和知识库自动扩充等场景。例如,在媒体监控中,模型可借助该数据集学习识别新闻报道中人物与组织间的隶属关系或事件发生地点。这种能力使得自动化系统能够从海量文本中提取结构化信息,辅助决策支持或内容摘要生成,提升了信息处理的效率与准确性。
衍生相关工作
基于CoNLL04数据集,衍生了一系列经典研究工作,如Roth和Yih提出的线性规划全局推理方法,为关系抽取的联合优化奠定了理论基础。后续研究如Eberts和Ulges开发的Span-based Transformer模型,进一步推动了预训练语言模型在实体与关系联合抽取中的应用。这些工作不仅提升了数据集的利用效能,也引领了信息抽取领域向更高效、精准的方向演进。
以上内容由遇见数据集搜集并总结生成



