extract-0
收藏魔搭社区2025-12-05 更新2025-11-03 收录
下载链接:
https://modelscope.cn/datasets/HenriqueGodoy/extract-0
下载链接
链接失效反馈官方服务:
资源简介:
# Extract-0 Document Information Extraction Dataset

This dataset contains 280,128 synthetic training examples for document information extraction, used to train Extract-0, a specialized 7B parameter language model that outperforms GPT-4 and other larger models on extraction tasks.
## Dataset Description
The Extract-0 dataset represents a comprehensive collection of document extraction examples generated from diverse sources including arXiv papers, PubMed Central articles, Wikipedia content, and FDA regulatory documents. Each example pairs a document chunk with a schema-based extraction task and its corresponding structured output.
### Dataset Statistics
- **Total extraction examples**: 280,128
- **Source documents**: 34,761 text chunks
- **Document sources**: arXiv, PubMed Central, Wikipedia, FDA databases
- **Average tokens per example**: 532-1900 tokens
- **Schema types**: Varied (objects, arrays, strings, dates, numbers)
## Files
- `train.csv`: Training examples with input schemas, expected outputs, and reference text IDs
- `documents.csv`: Source document chunks used for generating extraction examples
## Dataset Structure
### train.csv
Each row contains:
- `input`: JSON schema defining the extraction requirements
- `output`: Expected extraction result in JSON format
- `reference_text`: ID linking to the source document chunk
### documents.csv
Each row contains:
- `chunk_id`: Unique identifier for the document chunk
- `text`: Raw text content (up to 2000 characters per chunk)
## Model Performance
Extract-0, trained with part of this dataset, achieves:
- **Mean reward**: 0.573 (vs GPT-4: 0.457)
- **JSON validity**: 89.0%
## Usage
```python
from datasets import load_dataset
dataset = load_dataset("HenriqueGodoy/extract-0")
train_data = dataset["train"]
```
## Example
```python
{
"input": "{\"title\": {\"type\": \"string\", \"extraction_instruction\": \"Extract the full paper title exactly as it appears.\"}}",
"output": "{\"title\": \"Revolutionizing Reinforcement Learning Framework for Diffusion Large Language Models\"}",
"reference_text": "5_0"
}
```
## Methodology
The dataset was created using a memory-preserving synthetic data generation pipeline that:
1. **Document Processing**: Documents are chunked into 2000-character segments with 200-character overlap
2. **Sequential Extraction**: Chunks processed sequentially to maintain context consistency
3. **Augmentation**: Multi-field combinations generated with controlled token counts
4. **Validation**: All examples validated for JSON compliance and schema adherence
The generation process employs a mathematical formulation where for document chunks {c₁, c₂, ..., cₙ}, the extraction function E operates sequentially: E(cᵢ) = f(cᵢ, Mᵢ₋₁), maintaining accumulated memory M across chunks.
## Training Configuration
Models trained on this dataset used:
- **Base model**: DeepSeek-R1-Distill-Qwen-7B
- **Fine-tuning**: LoRA (rank=16, α=32) modifying 0.53% of parameters
- **Learning rate**: 1e-4 (SFT), 5e-5 (GRPO)
- **Batch size**: 16 (SFT), 64 effective (GRPO)
- **Max sequence length**: 2048 tokens
## Citation
If you use this dataset, please cite:
```bibtex
@misc{godoy2025extract0specializedlanguagemodel,
title={Extract-0: A Specialized Language Model for Document Information Extraction},
author={Henrique Godoy},
year={2025},
eprint={2509.22906},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2509.22906},
}
```
## License
Apache-2.0
## Contact
For questions or issues with the dataset, please open an issue in this repository.
# Extract-0 文档信息抽取数据集

本数据集包含280,128条用于文档信息抽取的合成训练样本,用于训练Extract-0——一款专为70亿参数设计的语言模型,其在抽取任务上的表现优于GPT-4及其他更大规模的模型。
## 数据集描述
Extract-0 数据集是一套覆盖全面的文档抽取示例集合,其生成来源涵盖arXiv论文、PubMed Central文献、维基百科内容以及FDA监管文档。每条样本均将文档片段与基于模式(Schema)的抽取任务及其对应的结构化输出进行配对。
### 数据集统计信息
- **总抽取样本数**:280,128
- **源文档**:34,761个文本片段
- **文档来源**:arXiv、PubMed Central、维基百科、FDA数据库
- **单样本平均Token数**:532-1900 Token
- **Schema类型**:多样化(包含对象、数组、字符串、日期、数值等类型)
## 文件
- `train.csv`:包含输入Schema、预期输出及参考文本ID的训练样本
- `documents.csv`:用于生成抽取样本的源文档片段
## 数据集结构
### train.csv
每行包含以下字段:
- `input`:定义抽取要求的JSON Schema
- `output`:JSON格式的预期抽取结果
- `reference_text`:指向源文档片段的ID
### documents.csv
每行包含以下字段:
- `chunk_id`:文档片段的唯一标识符
- `text`:原始文本内容(每个片段最多2000字符)
## 模型性能
使用本数据集的部分样本训练得到的Extract-0模型,其性能指标如下:
- **平均奖励值**:0.573(对比GPT-4:0.457)
- **JSON合规率**:89.0%
## 使用方法
python
from datasets import load_dataset
dataset = load_dataset("HenriqueGodoy/extract-0")
train_data = dataset["train"]
## 示例
python
{
"input": "{"title": {"type": "string", "extraction_instruction": "Extract the full paper title exactly as it appears."}}",
"output": "{"title": "Revolutionizing Reinforcement Learning Framework for Diffusion Large Language Models"}",
"reference_text": "5_0"
}
## 数据集构建方法
本数据集通过保留上下文记忆的合成数据生成流水线构建,具体步骤如下:
1. **文档切分**:将文档切分为2000字符的片段,且片段间保留200字符的重叠区域
2. **序列抽取**:按顺序处理文档片段,以维持上下文一致性
3. **数据增强**:生成包含多字段组合的样本,并控制Token数量
4. **样本校验**:对所有样本进行JSON合规性及Schema匹配性校验
该生成流程采用如下数学形式:对于文档片段集合{c₁, c₂, ..., cₙ},抽取函数E按顺序执行:E(cᵢ) = f(cᵢ, Mᵢ₋₁),并在各片段间维护累积的上下文记忆M。
## 训练配置
基于本数据集训练的模型采用以下配置:
- **基础模型**:DeepSeek-R1-Distill-Qwen-7B
- **微调方式**:LoRA(秩=16,α=32),仅更新0.53%的模型参数
- **学习率**:监督微调(SFT)为1e-4,策略优化(GRPO)为5e-5
- **批次大小**:监督微调为16,策略优化有效批次大小为64
- **最大序列长度**:2048个Token
## 引用声明
若您使用本数据集,请引用以下文献:
bibtex
@misc{godoy2025extract0specializedlanguagemodel,
title={Extract-0: A Specialized Language Model for Document Information Extraction},
author={Henrique Godoy},
year={2025},
eprint={2509.22906},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2509.22906},
}
## 开源协议
Apache-2.0
## 联系方式
若您对本数据集有任何疑问或问题,请在本仓库中提交Issue。
提供机构:
maas
创建时间:
2025-10-09



