five

extract-0

收藏
魔搭社区2025-12-05 更新2025-11-03 收录
下载链接:
https://modelscope.cn/datasets/HenriqueGodoy/extract-0
下载链接
链接失效反馈
官方服务:
资源简介:
# Extract-0 Document Information Extraction Dataset ![Extract-0](extract-zero.png) This dataset contains 280,128 synthetic training examples for document information extraction, used to train Extract-0, a specialized 7B parameter language model that outperforms GPT-4 and other larger models on extraction tasks. ## Dataset Description The Extract-0 dataset represents a comprehensive collection of document extraction examples generated from diverse sources including arXiv papers, PubMed Central articles, Wikipedia content, and FDA regulatory documents. Each example pairs a document chunk with a schema-based extraction task and its corresponding structured output. ### Dataset Statistics - **Total extraction examples**: 280,128 - **Source documents**: 34,761 text chunks - **Document sources**: arXiv, PubMed Central, Wikipedia, FDA databases - **Average tokens per example**: 532-1900 tokens - **Schema types**: Varied (objects, arrays, strings, dates, numbers) ## Files - `train.csv`: Training examples with input schemas, expected outputs, and reference text IDs - `documents.csv`: Source document chunks used for generating extraction examples ## Dataset Structure ### train.csv Each row contains: - `input`: JSON schema defining the extraction requirements - `output`: Expected extraction result in JSON format - `reference_text`: ID linking to the source document chunk ### documents.csv Each row contains: - `chunk_id`: Unique identifier for the document chunk - `text`: Raw text content (up to 2000 characters per chunk) ## Model Performance Extract-0, trained with part of this dataset, achieves: - **Mean reward**: 0.573 (vs GPT-4: 0.457) - **JSON validity**: 89.0% ## Usage ```python from datasets import load_dataset dataset = load_dataset("HenriqueGodoy/extract-0") train_data = dataset["train"] ``` ## Example ```python { "input": "{\"title\": {\"type\": \"string\", \"extraction_instruction\": \"Extract the full paper title exactly as it appears.\"}}", "output": "{\"title\": \"Revolutionizing Reinforcement Learning Framework for Diffusion Large Language Models\"}", "reference_text": "5_0" } ``` ## Methodology The dataset was created using a memory-preserving synthetic data generation pipeline that: 1. **Document Processing**: Documents are chunked into 2000-character segments with 200-character overlap 2. **Sequential Extraction**: Chunks processed sequentially to maintain context consistency 3. **Augmentation**: Multi-field combinations generated with controlled token counts 4. **Validation**: All examples validated for JSON compliance and schema adherence The generation process employs a mathematical formulation where for document chunks {c₁, c₂, ..., cₙ}, the extraction function E operates sequentially: E(cᵢ) = f(cᵢ, Mᵢ₋₁), maintaining accumulated memory M across chunks. ## Training Configuration Models trained on this dataset used: - **Base model**: DeepSeek-R1-Distill-Qwen-7B - **Fine-tuning**: LoRA (rank=16, α=32) modifying 0.53% of parameters - **Learning rate**: 1e-4 (SFT), 5e-5 (GRPO) - **Batch size**: 16 (SFT), 64 effective (GRPO) - **Max sequence length**: 2048 tokens ## Citation If you use this dataset, please cite: ```bibtex @misc{godoy2025extract0specializedlanguagemodel, title={Extract-0: A Specialized Language Model for Document Information Extraction}, author={Henrique Godoy}, year={2025}, eprint={2509.22906}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2509.22906}, } ``` ## License Apache-2.0 ## Contact For questions or issues with the dataset, please open an issue in this repository.

# Extract-0 文档信息抽取数据集 ![Extract-0](extract-zero.png) 本数据集包含280,128条用于文档信息抽取的合成训练样本,用于训练Extract-0——一款专为70亿参数设计的语言模型,其在抽取任务上的表现优于GPT-4及其他更大规模的模型。 ## 数据集描述 Extract-0 数据集是一套覆盖全面的文档抽取示例集合,其生成来源涵盖arXiv论文、PubMed Central文献、维基百科内容以及FDA监管文档。每条样本均将文档片段与基于模式(Schema)的抽取任务及其对应的结构化输出进行配对。 ### 数据集统计信息 - **总抽取样本数**:280,128 - **源文档**:34,761个文本片段 - **文档来源**:arXiv、PubMed Central、维基百科、FDA数据库 - **单样本平均Token数**:532-1900 Token - **Schema类型**:多样化(包含对象、数组、字符串、日期、数值等类型) ## 文件 - `train.csv`:包含输入Schema、预期输出及参考文本ID的训练样本 - `documents.csv`:用于生成抽取样本的源文档片段 ## 数据集结构 ### train.csv 每行包含以下字段: - `input`:定义抽取要求的JSON Schema - `output`:JSON格式的预期抽取结果 - `reference_text`:指向源文档片段的ID ### documents.csv 每行包含以下字段: - `chunk_id`:文档片段的唯一标识符 - `text`:原始文本内容(每个片段最多2000字符) ## 模型性能 使用本数据集的部分样本训练得到的Extract-0模型,其性能指标如下: - **平均奖励值**:0.573(对比GPT-4:0.457) - **JSON合规率**:89.0% ## 使用方法 python from datasets import load_dataset dataset = load_dataset("HenriqueGodoy/extract-0") train_data = dataset["train"] ## 示例 python { "input": "{"title": {"type": "string", "extraction_instruction": "Extract the full paper title exactly as it appears."}}", "output": "{"title": "Revolutionizing Reinforcement Learning Framework for Diffusion Large Language Models"}", "reference_text": "5_0" } ## 数据集构建方法 本数据集通过保留上下文记忆的合成数据生成流水线构建,具体步骤如下: 1. **文档切分**:将文档切分为2000字符的片段,且片段间保留200字符的重叠区域 2. **序列抽取**:按顺序处理文档片段,以维持上下文一致性 3. **数据增强**:生成包含多字段组合的样本,并控制Token数量 4. **样本校验**:对所有样本进行JSON合规性及Schema匹配性校验 该生成流程采用如下数学形式:对于文档片段集合{c₁, c₂, ..., cₙ},抽取函数E按顺序执行:E(cᵢ) = f(cᵢ, Mᵢ₋₁),并在各片段间维护累积的上下文记忆M。 ## 训练配置 基于本数据集训练的模型采用以下配置: - **基础模型**:DeepSeek-R1-Distill-Qwen-7B - **微调方式**:LoRA(秩=16,α=32),仅更新0.53%的模型参数 - **学习率**:监督微调(SFT)为1e-4,策略优化(GRPO)为5e-5 - **批次大小**:监督微调为16,策略优化有效批次大小为64 - **最大序列长度**:2048个Token ## 引用声明 若您使用本数据集,请引用以下文献: bibtex @misc{godoy2025extract0specializedlanguagemodel, title={Extract-0: A Specialized Language Model for Document Information Extraction}, author={Henrique Godoy}, year={2025}, eprint={2509.22906}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2509.22906}, } ## 开源协议 Apache-2.0 ## 联系方式 若您对本数据集有任何疑问或问题,请在本仓库中提交Issue。
提供机构:
maas
创建时间:
2025-10-09
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作