extract-0

Name: extract-0
Creator: maas
Published: 2025-12-05 16:53:46
License: 暂无描述

魔搭社区2025-12-05 更新2025-11-03 收录

下载链接：

https://modelscope.cn/datasets/HenriqueGodoy/extract-0

下载链接

链接失效反馈

官方服务：

资源简介：

# Extract-0 Document Information Extraction Dataset ![Extract-0](extract-zero.png) This dataset contains 280,128 synthetic training examples for document information extraction, used to train Extract-0, a specialized 7B parameter language model that outperforms GPT-4 and other larger models on extraction tasks. ## Dataset Description The Extract-0 dataset represents a comprehensive collection of document extraction examples generated from diverse sources including arXiv papers, PubMed Central articles, Wikipedia content, and FDA regulatory documents. Each example pairs a document chunk with a schema-based extraction task and its corresponding structured output. ### Dataset Statistics - **Total extraction examples**: 280,128 - **Source documents**: 34,761 text chunks - **Document sources**: arXiv, PubMed Central, Wikipedia, FDA databases - **Average tokens per example**: 532-1900 tokens - **Schema types**: Varied (objects, arrays, strings, dates, numbers) ## Files - `train.csv`: Training examples with input schemas, expected outputs, and reference text IDs - `documents.csv`: Source document chunks used for generating extraction examples ## Dataset Structure ### train.csv Each row contains: - `input`: JSON schema defining the extraction requirements - `output`: Expected extraction result in JSON format - `reference_text`: ID linking to the source document chunk ### documents.csv Each row contains: - `chunk_id`: Unique identifier for the document chunk - `text`: Raw text content (up to 2000 characters per chunk) ## Model Performance Extract-0, trained with part of this dataset, achieves: - **Mean reward**: 0.573 (vs GPT-4: 0.457) - **JSON validity**: 89.0% ## Usage ```python from datasets import load_dataset dataset = load_dataset("HenriqueGodoy/extract-0") train_data = dataset["train"] ``` ## Example ```python { "input": "{\"title\": {\"type\": \"string\", \"extraction_instruction\": \"Extract the full paper title exactly as it appears.\"}}", "output": "{\"title\": \"Revolutionizing Reinforcement Learning Framework for Diffusion Large Language Models\"}", "reference_text": "5_0" } ``` ## Methodology The dataset was created using a memory-preserving synthetic data generation pipeline that: 1. **Document Processing**: Documents are chunked into 2000-character segments with 200-character overlap 2. **Sequential Extraction**: Chunks processed sequentially to maintain context consistency 3. **Augmentation**: Multi-field combinations generated with controlled token counts 4. **Validation**: All examples validated for JSON compliance and schema adherence The generation process employs a mathematical formulation where for document chunks {c₁, c₂, ..., cₙ}, the extraction function E operates sequentially: E(cᵢ) = f(cᵢ, Mᵢ₋₁), maintaining accumulated memory M across chunks. ## Training Configuration Models trained on this dataset used: - **Base model**: DeepSeek-R1-Distill-Qwen-7B - **Fine-tuning**: LoRA (rank=16, α=32) modifying 0.53% of parameters - **Learning rate**: 1e-4 (SFT), 5e-5 (GRPO) - **Batch size**: 16 (SFT), 64 effective (GRPO) - **Max sequence length**: 2048 tokens ## Citation If you use this dataset, please cite: ```bibtex @misc{godoy2025extract0specializedlanguagemodel, title={Extract-0: A Specialized Language Model for Document Information Extraction}, author={Henrique Godoy}, year={2025}, eprint={2509.22906}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2509.22906}, } ``` ## License Apache-2.0 ## Contact For questions or issues with the dataset, please open an issue in this repository.

# Extract-0 文档信息抽取数据集 ![Extract-0](extract-zero.png) 本数据集包含280,128条用于文档信息抽取的合成训练样本，用于训练Extract-0——一款专为70亿参数设计的语言模型，其在抽取任务上的表现优于GPT-4及其他更大规模的模型。 ## 数据集描述 Extract-0 数据集是一套覆盖全面的文档抽取示例集合，其生成来源涵盖arXiv论文、PubMed Central文献、维基百科内容以及FDA监管文档。每条样本均将文档片段与基于模式（Schema）的抽取任务及其对应的结构化输出进行配对。 ### 数据集统计信息 - **总抽取样本数**：280,128 - **源文档**：34,761个文本片段 - **文档来源**：arXiv、PubMed Central、维基百科、FDA数据库 - **单样本平均Token数**：532-1900 Token - **Schema类型**：多样化（包含对象、数组、字符串、日期、数值等类型） ## 文件 - `train.csv`：包含输入Schema、预期输出及参考文本ID的训练样本 - `documents.csv`：用于生成抽取样本的源文档片段 ## 数据集结构 ### train.csv 每行包含以下字段： - `input`：定义抽取要求的JSON Schema - `output`：JSON格式的预期抽取结果 - `reference_text`：指向源文档片段的ID ### documents.csv 每行包含以下字段： - `chunk_id`：文档片段的唯一标识符 - `text`：原始文本内容（每个片段最多2000字符） ## 模型性能使用本数据集的部分样本训练得到的Extract-0模型，其性能指标如下： - **平均奖励值**：0.573（对比GPT-4：0.457） - **JSON合规率**：89.0% ## 使用方法 python from datasets import load_dataset dataset = load_dataset("HenriqueGodoy/extract-0") train_data = dataset["train"] ## 示例 python { "input": "{"title": {"type": "string", "extraction_instruction": "Extract the full paper title exactly as it appears."}}", "output": "{"title": "Revolutionizing Reinforcement Learning Framework for Diffusion Large Language Models"}", "reference_text": "5_0" } ## 数据集构建方法本数据集通过保留上下文记忆的合成数据生成流水线构建，具体步骤如下： 1. **文档切分**：将文档切分为2000字符的片段，且片段间保留200字符的重叠区域 2. **序列抽取**：按顺序处理文档片段，以维持上下文一致性 3. **数据增强**：生成包含多字段组合的样本，并控制Token数量 4. **样本校验**：对所有样本进行JSON合规性及Schema匹配性校验该生成流程采用如下数学形式：对于文档片段集合{c₁, c₂, ..., cₙ}，抽取函数E按顺序执行：E(cᵢ) = f(cᵢ, Mᵢ₋₁)，并在各片段间维护累积的上下文记忆M。 ## 训练配置基于本数据集训练的模型采用以下配置： - **基础模型**：DeepSeek-R1-Distill-Qwen-7B - **微调方式**：LoRA（秩=16，α=32），仅更新0.53%的模型参数 - **学习率**：监督微调（SFT）为1e-4，策略优化（GRPO）为5e-5 - **批次大小**：监督微调为16，策略优化有效批次大小为64 - **最大序列长度**：2048个Token ## 引用声明若您使用本数据集，请引用以下文献： bibtex @misc{godoy2025extract0specializedlanguagemodel, title={Extract-0: A Specialized Language Model for Document Information Extraction}, author={Henrique Godoy}, year={2025}, eprint={2509.22906}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2509.22906}, } ## 开源协议 Apache-2.0 ## 联系方式若您对本数据集有任何疑问或问题，请在本仓库中提交Issue。

提供机构：

maas

创建时间：

2025-10-09

5,000+

优质数据集

54 个

任务类型

进入经典数据集