olmOCR-synthmix-1025
收藏魔搭社区2025-12-04 更新2025-11-29 收录
下载链接:
https://modelscope.cn/datasets/allenai/olmOCR-synthmix-1025
下载链接
链接失效反馈官方服务:
资源简介:
# olmOCR-synthmix-1025
olmOCR-synthmix-1025 is a dataset of 2,186 single PDF pages, that have been synthetically rerendered into HTML by
`claude-sonnet-4-20250514`.
In total, across these PDF pages, 30,381 synthetic benchmark cases have been created, following the format of [olmOCR-bench](https://huggingface.co/datasets/allenai/olmOCR-bench).
These documents contain no overlap with the original olmOCR-bench documents, and thus can be used as RLVR training
data to improve the performance of OCR engines.
## Directory Structure
```
olmocr-synthmix-1025/
├── bench_data/ # olmOCR-bench format benchmark data
│ ├── *.jsonl # olmOCR-bench test cases (5 files)
│ │ ├── arxiv_cs.jsonl
│ │ ├── arxiv_math.jsonl
│ │ ├── arxiv_physics.jsonl
│ │ ├── general.jsonl
│ │ └── tables.jsonl
│ ├── pdfs/ # olmOCR-bench PDF files (2,187 files)
│ │ ├── arxiv_cs/ (337 files)
│ │ ├── arxiv_math/ (342 files)
│ │ ├── arxiv_physics/ (102 files)
│ │ ├── general/ (952 files)
│ │ └── tables/ (454 files)
│ └── claude_original/ # Original Claude OCR outputs (2,197 files)
│ ├── arxiv_cs/ (337 files)
│ ├── arxiv_math/ (342 files)
│ ├── arxiv_physics/ (102 files)
│ ├── general/ (956 files)
│ └── tables/ (460 files)
│
├── html/ # HTML renders of each PDF page from Claude Sonnet(2,197 files)
│ ├── arxiv_cs/ (337 files)
│ ├── arxiv_math/ (342 files)
│ ├── arxiv_physics/ (102 files)
│ ├── general/ (956 files)
│ └── tables/ (460 files)
│
├── metadata/ # Additional metadata for each PDF, contains original URLS (5 files)
│ ├── arxiv_cs.jsonl
│ ├── arxiv_math.jsonl
│ ├── arxiv_physics.jsonl
│ ├── general.jsonl
│ └── tables.jsonl
│
├── pdfs/ # Side-by-side Original and HTML-rerendered PDFs (4,394 files)
│ ├── arxiv_cs/ (674 files)
│ ├── arxiv_math/ (684 files)
│ ├── arxiv_physics/ (204 files)
│ ├── general/ (1,912 files)
│ └── tables/ (920 files)
│
└── training/ # Training data with markdown + PDFs (4,394 files)
├── arxiv_cs/ (674 files: .md + .pdf pairs)
├── arxiv_math/ (684 files: .md + .pdf pairs)
├── arxiv_physics/ (204 files: .md + .pdf pairs)
├── general/ (1,912 files: .md + .pdf pairs)
└── tables/ (920 files: .md + .pdf pairs)
```
### How to use this dataset
1. You may test your own OCR model's performance on this, it is a perfectly valid olmOCR-bench style benchmark that can be run
using the standard olmOCR-bench tools located [here](https://github.com/allenai/olmocr/tree/main/olmocr/bench).
2. You may also choose to train your model using GRPO or similar techniques on this data. See the [olmOCR trainer code](https://github.com/allenai/olmocr/tree/main/olmocr/train) for more details.
### How this dataset was made
Please see the [mine_html_templates.py](https://github.com/allenai/olmocr/blob/main/olmocr/bench/synth/mine_html_templates.py) script in the olmOCR repo.
This script was run against 5 different subsets of PDFs.
`arxiv_cs`, `arxiv_math`, `arxiv_physics` each were samples of recent arxiv papers from those subsets.
`general` consists of files sampled from the same internal crawl of web-pdfs as olmOCR-mix-0225 uses.
`tables` consists of files sampled from the same internal crawl of web-pdfs as olmOCR-mix-0225 uses, but filtered to pages that include a table using a script that prompts gpt-4o.
# License
This dataset is licensed under ODC-BY. It is intended for research and educational use in accordance with Ai2's [Responsible Use Guidelines](https://allenai.org/responsible-use).
# olmOCR-synthmix-1025
olmOCR-synthmix-1025 是一个包含2186个单页PDF的数据集,这些PDF已由`claude-sonnet-4-20250514`合成重渲染为HTML格式。
整体而言,基于这些PDF页面共生成了30381个合成基准测试用例,格式遵循[olmOCR-bench](https://huggingface.co/datasets/allenai/olmOCR-bench)规范。
本数据集的文档与原始olmOCR-bench文档无重叠,因此可作为RLVR训练数据,用于提升光学字符识别(Optical Character Recognition, OCR)引擎的性能。
## 目录结构
olmocr-synthmix-1025/
├── bench_data/ # olmOCR-bench格式基准测试数据
│ ├── *.jsonl # olmOCR-bench测试用例(共5个文件)
│ │ ├── arxiv_cs.jsonl
│ │ ├── arxiv_math.jsonl
│ │ ├── arxiv_physics.jsonl
│ │ ├── general.jsonl
│ │ └── tables.jsonl
│ ├── pdfs/ # olmOCR-bench所用PDF文件(共2187个文件)
│ │ ├── arxiv_cs/ (337个文件)
│ │ ├── arxiv_math/ (342个文件)
│ │ ├── arxiv_physics/ (102个文件)
│ │ ├── general/ (952个文件)
│ │ └── tables/ (454个文件)
│ ├── claude_original/ # Claude原始OCR输出结果(共2197个文件)
│ ├── arxiv_cs/ (337个文件)
│ ├── arxiv_math/ (342个文件)
│ ├── arxiv_physics/ (102个文件)
│ ├── general/ (956个文件)
│ └── tables/ (460个文件)
│
├── html/ # 各PDF页面的HTML渲染结果(共2197个文件)
│ ├── arxiv_cs/ (337个文件)
│ ├── arxiv_math/ (342个文件)
│ ├── arxiv_physics/ (102个文件)
│ ├── general/ (956个文件)
│ └── tables/ (460个文件)
│
├── metadata/ # 各PDF的附加元数据,包含原始URL(共5个文件)
│ ├── arxiv_cs.jsonl
│ ├── arxiv_math.jsonl
│ ├── arxiv_physics.jsonl
│ ├── general.jsonl
│ └── tables.jsonl
│
├── pdfs/ # 原始PDF与HTML重渲染PDF的并排版本(共4394个文件)
│ ├── arxiv_cs/ (674个文件)
│ ├── arxiv_math/ (684个文件)
│ ├── arxiv_physics/ (204个文件)
│ ├── general/ (1,912个文件)
│ └── tables/ (920个文件)
│
└── training/ # 包含Markdown与PDF的训练数据(共4394个文件)
├── arxiv_cs/ (674个文件:.md + .pdf配对文件)
├── arxiv_math/ (684个文件:.md + .pdf配对文件)
├── arxiv_physics/ (204个文件:.md + .pdf配对文件)
├── general/ (1,912个文件:.md + .pdf配对文件)
└── tables/ (920个文件:.md + .pdf配对文件)
### 数据集使用方法
1. 可使用该数据集测试自研OCR模型的性能:本数据集为完全符合olmOCR-bench规范的基准测试集,可通过位于[此处](https://github.com/allenai/olmocr/tree/main/olmocr/bench)的标准olmOCR-bench工具运行测试。
2. 也可使用GRPO或同类强化学习技术在该数据集上训练模型,详细信息可参阅[olmOCR训练代码](https://github.com/allenai/olmocr/tree/main/olmocr/train)。
### 数据集构建方式
请参阅olmOCR仓库中的[mine_html_templates.py](https://github.com/allenai/olmocr/blob/main/olmocr/bench/synth/mine_html_templates.py)脚本。该脚本针对5类不同的PDF子集执行构建:
- `arxiv_cs`、`arxiv_math`、`arxiv_physics`分别为对应学科领域的最新arXiv论文采样样本;
- `general`子集的文件采样自与olmOCR-mix-0225相同的内部网络PDF爬取库;
- `tables`子集的文件同样采样自olmOCR-mix-0225所用的内部网络PDF爬取库,但通过调用GPT-4o的脚本过滤出了包含表格的页面。
## 许可证
本数据集采用ODC-BY许可证开源,仅可用于研究与教育用途,并需遵循AI2的[负责任使用指南](https://allenai.org/responsible-use)。
提供机构:
maas
创建时间:
2025-10-22



