five

olmOCR-mix-1025

收藏
魔搭社区2025-12-05 更新2025-11-03 收录
下载链接:
https://modelscope.cn/datasets/allenai/olmOCR-mix-1025
下载链接
链接失效反馈
官方服务:
资源简介:
# olmOCR-mix-1025 olmOCR-mix-1025 is a dataset of ~270,000 PDF pages which have been OCRed into plain-text in a natural reading order using gpt-4.1 and a special prompting strategy that preserves any born-digital content from each page. This dataset can be used to train, fine-tune, or evaluate your own OCR document pipeline, and **all PDF pages used are included for download**. Compared to [olmOCR-mix-0225](https://huggingface.co/datasets/allenai/olmOCR-mix-0225/), this dataset includes: - Cleaner outputs processed with gpt-4.1 - More consistent equation formatting with \\[ and \\( for block and inline math - Tables in HTML format instead of markdown - Basic Alt Text for images - More handwriting and historical documents ## Dataset Statistics | Subset | Train | Eval | Total | |--------|------:|-----:|------:| | 00_documents | 231,668 | 1,122 | 232,790 | | 01_books | 16,575 | 899 | 17,474 | | 02_loc_transcripts | 9,891 | 98 | 9,989 | | 03_national_archives | 9,828 | 169 | 9,997 | | **Total** | **267,962** | **2,288** | **270,250** | ### Language Distribution | Subset | 1st | 2nd | 3rd | 4th | 5th | |--------|-----|-----|-----|-----|-----| | 00_documents | en (94.46%) | es (0.58%) | fr (0.46%) | id (0.45%) | de (0.42%) | | 01_books | en (91.28%) | fr (0.54%) | la (0.31%) | de (0.27%) | hi (0.12%) | | 02_loc_transcripts | en (98.21%) | es (0.59%) | fr (0.46%) | de (0.45%) | it (0.11%) | | 03_national_archives | en (99.82%) | es (0.12%) | fr (0.02%) | sv (0.01%) | de (0.01%) | ## How to use this dataset On Hugging Face, this dataset consists of a bunch of .tar.gz files, around 1GB each, which contain single page PDF documents extracted from various sources. Also stored are parquet files which contain all of the metadata and natural text transcriptions that we consider to be the ground truth for this dataset. We combine all these so that there are not millions of files stored in this single HF dataset, and so that you can analyze the data using the Dataset Viewer. However, when you go to train a model on this data, you may want to just pre-extract all the PDFs and documents into a local folder structure. For that, install the olmocr toolkit and run the following ```bash pip install olmocr # You can pick a specific split and subset to download, or just run all these commands in order to get everything python -m olmocr.data.prepare_olmocrmix --dataset-path allenai/olmOCR-mix-1025 --destination ~/olmOCR-mix-1025-extracted --subset 00_documents --split train python -m olmocr.data.prepare_olmocrmix --dataset-path allenai/olmOCR-mix-1025 --destination ~/olmOCR-mix-1025-extracted --subset 00_documents --split eval python -m olmocr.data.prepare_olmocrmix --dataset-path allenai/olmOCR-mix-1025 --destination ~/olmOCR-mix-1025-extracted --subset 01_books --split train python -m olmocr.data.prepare_olmocrmix --dataset-path allenai/olmOCR-mix-1025 --destination ~/olmOCR-mix-1025-extracted --subset 01_books --split eval python -m olmocr.data.prepare_olmocrmix --dataset-path allenai/olmOCR-mix-1025 --destination ~/olmOCR-mix-1025-extracted --subset 02_loc_transcripts --split train python -m olmocr.data.prepare_olmocrmix --dataset-path allenai/olmOCR-mix-1025 --destination ~/olmOCR-mix-1025-extracted --subset 02_loc_transcripts --split eval python -m olmocr.data.prepare_olmocrmix --dataset-path allenai/olmOCR-mix-1025 --destination ~/olmOCR-mix-1025-extracted --subset 03_national_archives --split train python -m olmocr.data.prepare_olmocrmix --dataset-path allenai/olmOCR-mix-1025 --destination ~/olmOCR-mix-1025-extracted --subset 03_national_archives --split eval ``` ## How this dataset was made In summary, for the `00_documents` and `01_books` splits, pages of PDFs are rendered and passed to Chat GPT 4.1 which is prompted for a high quality direct transcription of the page into natural text. For the `02_loc_transcripts` and `03_national_archives` splits, we downloaded from the Library of Congress and National Archives historical documents with known, high quality, human made transcriptions. From there, we prompted ChatGPT in order to clean up the transcriptions and remove any spurious text. Scripts used to produce this dataset are primarily located here: https://github.com/allenai/olmocr/tree/main/olmocr/data # License This dataset is licensed under ODC-BY. It is intended for research and educational use in accordance with Ai2's [Responsible Use Guidelines](https://allenai.org/responsible-use).

# olmOCR-mix-1025 olmOCR-mix-1025 是一个包含约27万页PDF的数据集,这些PDF已通过GPT-4.1和一种可保留每页原生数字内容的特殊提示策略,按照自然阅读顺序完成光学字符识别(Optical Character Recognition,OCR)并转换为纯文本。 本数据集可用于训练、微调或评估自研的OCR文档处理流程,**所有所用PDF页面均提供下载**。 相较于[olmOCR-mix-0225](https://huggingface.co/datasets/allenai/olmOCR-mix-0225),本数据集新增以下特性: - 采用GPT-4.1处理得到的更干净的输出结果 - 采用`[`和`(`分别格式化块级和行内数学公式,实现更统一的公式排版 - 采用HTML格式而非Markdown格式呈现表格 - 为图片提供基础替代文本(Alt Text) - 包含更多手写文档与历史文献 ## 数据集统计 | 子集 | 训练集 | 验证集 | 总计 | |--------|------:|-----:|------:| | 00_documents | 231,668 | 1,122 | 232,790 | | 01_books | 16,575 | 899 | 17,474 | | 02_loc_transcripts | 9,891 | 98 | 9,989 | | 03_national_archives | 9,828 | 169 | 9,997 | | **总计** | **267,962** | **2,288** | **270,250** | ### 语言分布 | 子集 | 第一语言 | 第二语言 | 第三语言 | 第四语言 | 第五语言 | |--------|-----|-----|-----|-----|-----| | 00_documents | 英语(94.46%) | 西班牙语(0.58%) | 法语(0.46%) | 印尼语(0.45%) | 德语(0.42%) | | 01_books | 英语(91.28%) | 法语(0.54%) | 拉丁语(0.31%) | 德语(0.27%) | 印地语(0.12%) | | 02_loc_transcripts | 英语(98.21%) | 西班牙语(0.59%) | 法语(0.46%) | 德语(0.45%) | 意大利语(0.11%) | | 03_national_archives | 英语(99.82%) | 西班牙语(0.12%) | 法语(0.02%) | 瑞典语(0.01%) | 德语(0.01%) | ## 数据集使用方法 在Hugging Face平台上,本数据集由若干个约1GB大小的.tar.gz文件组成,其中包含从各类来源提取的单页PDF文档。此外还存储了Parquet文件,包含本数据集的全部元数据与自然文本转录结果,我们将其作为本数据集的基准真值。 我们合并了这些文件,以避免在单个Hugging Face数据集中存储数百万个文件,同时方便用户通过数据集查看器分析数据。 不过,当你需要基于此数据训练模型时,可能更希望将所有PDF与文档预先提取至本地文件夹结构中。为此,请安装olmocr工具包并运行以下命令: bash pip install olmocr # 你可以选择特定的划分与子集进行下载,或依次运行以下命令获取全部数据 python -m olmocr.data.prepare_olmocrmix --dataset-path allenai/olmOCR-mix-1025 --destination ~/olmOCR-mix-1025-extracted --subset 00_documents --split train python -m olmocr.data.prepare_olmocrmix --dataset-path allenai/olmOCR-mix-1025 --destination ~/olmOCR-mix-1025-extracted --subset 00_documents --split eval python -m olmocr.data.prepare_olmocrmix --dataset-path allenai/olmOCR-mix-1025 --destination ~/olmOCR-mix-1025-extracted --subset 01_books --split train python -m olmocr.data.prepare_olmocrmix --dataset-path allenai/olmOCR-mix-1025 --destination ~/olmOCR-mix-1025-extracted --subset 01_books --split eval python -m olmocr.data.prepare_olmocrmix --dataset-path allenai/olmOCR-mix-1025 --destination ~/olmOCR-mix-1025-extracted --subset 02_loc_transcripts --split train python -m olmocr.data.prepare_olmocrmix --dataset-path allenai/olmOCR-mix-1025 --destination ~/olmOCR-mix-1025-extracted --subset 02_loc_transcripts --split eval python -m olmocr.data.prepare_olmocrmix --dataset-path allenai/olmOCR-mix-1025 --destination ~/olmOCR-mix-1025-extracted --subset 03_national_archives --split train python -m olmocr.data.prepare_olmocrmix --dataset-path allenai/olmOCR-mix-1025 --destination ~/olmOCR-mix-1025-extracted --subset 03_national_archives --split eval ## 数据集构建流程 简言之,对于`00_documents`与`01_books`划分,我们先渲染PDF页面,再将其传入ChatGPT 4.1,通过提示词要求其将页面高质量直接转录为自然文本。 对于`02_loc_transcripts`与`03_national_archives`划分,我们从美国国会图书馆与美国国家档案馆下载了带有已知高质量人工转录结果的历史文档。随后我们通过提示ChatGPT来清理转录文本,移除冗余内容。 用于构建本数据集的脚本主要位于此处:https://github.com/allenai/olmocr/tree/main/olmocr/data ## 许可证 本数据集采用开放数据共同体署名许可证(Open Data Commons Attribution License,ODC-BY)授权,可用于研究与教育用途,具体需遵循Ai2的[负责任使用指南](https://allenai.org/responsible-use)。
提供机构:
maas
创建时间:
2025-10-22
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作