下载链接：

https://modelscope.cn/datasets/allenai/olmOCR-synthmix-1025

下载链接

链接失效反馈

官方服务：

资源简介：

# olmOCR-synthmix-1025 olmOCR-synthmix-1025 is a dataset of 2,186 single PDF pages, that have been synthetically rerendered into HTML by `claude-sonnet-4-20250514`. In total, across these PDF pages, 30,381 synthetic benchmark cases have been created, following the format of [olmOCR-bench](https://huggingface.co/datasets/allenai/olmOCR-bench). These documents contain no overlap with the original olmOCR-bench documents, and thus can be used as RLVR training data to improve the performance of OCR engines. ## Directory Structure ``` olmocr-synthmix-1025/ ├── bench_data/ # olmOCR-bench format benchmark data │ ├── *.jsonl # olmOCR-bench test cases (5 files) │ │ ├── arxiv_cs.jsonl │ │ ├── arxiv_math.jsonl │ │ ├── arxiv_physics.jsonl │ │ ├── general.jsonl │ │ └── tables.jsonl │ ├── pdfs/ # olmOCR-bench PDF files (2,187 files) │ │ ├── arxiv_cs/ (337 files) │ │ ├── arxiv_math/ (342 files) │ │ ├── arxiv_physics/ (102 files) │ │ ├── general/ (952 files) │ │ └── tables/ (454 files) │ └── claude_original/ # Original Claude OCR outputs (2,197 files) │ ├── arxiv_cs/ (337 files) │ ├── arxiv_math/ (342 files) │ ├── arxiv_physics/ (102 files) │ ├── general/ (956 files) │ └── tables/ (460 files) │ ├── html/ # HTML renders of each PDF page from Claude Sonnet(2,197 files) │ ├── arxiv_cs/ (337 files) │ ├── arxiv_math/ (342 files) │ ├── arxiv_physics/ (102 files) │ ├── general/ (956 files) │ └── tables/ (460 files) │ ├── metadata/ # Additional metadata for each PDF, contains original URLS (5 files) │ ├── arxiv_cs.jsonl │ ├── arxiv_math.jsonl │ ├── arxiv_physics.jsonl │ ├── general.jsonl │ └── tables.jsonl │ ├── pdfs/ # Side-by-side Original and HTML-rerendered PDFs (4,394 files) │ ├── arxiv_cs/ (674 files) │ ├── arxiv_math/ (684 files) │ ├── arxiv_physics/ (204 files) │ ├── general/ (1,912 files) │ └── tables/ (920 files) │ └── training/ # Training data with markdown + PDFs (4,394 files) ├── arxiv_cs/ (674 files: .md + .pdf pairs) ├── arxiv_math/ (684 files: .md + .pdf pairs) ├── arxiv_physics/ (204 files: .md + .pdf pairs) ├── general/ (1,912 files: .md + .pdf pairs) └── tables/ (920 files: .md + .pdf pairs) ``` ### How to use this dataset 1. You may test your own OCR model's performance on this, it is a perfectly valid olmOCR-bench style benchmark that can be run using the standard olmOCR-bench tools located [here](https://github.com/allenai/olmocr/tree/main/olmocr/bench). 2. You may also choose to train your model using GRPO or similar techniques on this data. See the [olmOCR trainer code](https://github.com/allenai/olmocr/tree/main/olmocr/train) for more details. ### How this dataset was made Please see the [mine_html_templates.py](https://github.com/allenai/olmocr/blob/main/olmocr/bench/synth/mine_html_templates.py) script in the olmOCR repo. This script was run against 5 different subsets of PDFs. `arxiv_cs`, `arxiv_math`, `arxiv_physics` each were samples of recent arxiv papers from those subsets. `general` consists of files sampled from the same internal crawl of web-pdfs as olmOCR-mix-0225 uses. `tables` consists of files sampled from the same internal crawl of web-pdfs as olmOCR-mix-0225 uses, but filtered to pages that include a table using a script that prompts gpt-4o. # License This dataset is licensed under ODC-BY. It is intended for research and educational use in accordance with Ai2's [Responsible Use Guidelines](https://allenai.org/responsible-use).

# olmOCR-synthmix-1025 olmOCR-synthmix-1025 是一个包含2186个单页PDF的数据集，这些PDF已由`claude-sonnet-4-20250514`合成重渲染为HTML格式。整体而言，基于这些PDF页面共生成了30381个合成基准测试用例，格式遵循[olmOCR-bench](https://huggingface.co/datasets/allenai/olmOCR-bench)规范。本数据集的文档与原始olmOCR-bench文档无重叠，因此可作为RLVR训练数据，用于提升光学字符识别（Optical Character Recognition, OCR）引擎的性能。 ## 目录结构 olmocr-synthmix-1025/ ├── bench_data/ # olmOCR-bench格式基准测试数据 │ ├── *.jsonl # olmOCR-bench测试用例（共5个文件） │ │ ├── arxiv_cs.jsonl │ │ ├── arxiv_math.jsonl │ │ ├── arxiv_physics.jsonl │ │ ├── general.jsonl │ │ └── tables.jsonl │ ├── pdfs/ # olmOCR-bench所用PDF文件（共2187个文件） │ │ ├── arxiv_cs/ (337个文件) │ │ ├── arxiv_math/ (342个文件) │ │ ├── arxiv_physics/ (102个文件) │ │ ├── general/ (952个文件) │ │ └── tables/ (454个文件) │ ├── claude_original/ # Claude原始OCR输出结果（共2197个文件） │ ├── arxiv_cs/ (337个文件) │ ├── arxiv_math/ (342个文件) │ ├── arxiv_physics/ (102个文件) │ ├── general/ (956个文件) │ └── tables/ (460个文件) │ ├── html/ # 各PDF页面的HTML渲染结果（共2197个文件） │ ├── arxiv_cs/ (337个文件) │ ├── arxiv_math/ (342个文件) │ ├── arxiv_physics/ (102个文件) │ ├── general/ (956个文件) │ └── tables/ (460个文件) │ ├── metadata/ # 各PDF的附加元数据，包含原始URL（共5个文件） │ ├── arxiv_cs.jsonl │ ├── arxiv_math.jsonl │ ├── arxiv_physics.jsonl │ ├── general.jsonl │ └── tables.jsonl │ ├── pdfs/ # 原始PDF与HTML重渲染PDF的并排版本（共4394个文件） │ ├── arxiv_cs/ (674个文件) │ ├── arxiv_math/ (684个文件) │ ├── arxiv_physics/ (204个文件) │ ├── general/ (1,912个文件) │ └── tables/ (920个文件) │ └── training/ # 包含Markdown与PDF的训练数据（共4394个文件） ├── arxiv_cs/ (674个文件：.md + .pdf配对文件) ├── arxiv_math/ (684个文件：.md + .pdf配对文件) ├── arxiv_physics/ (204个文件：.md + .pdf配对文件) ├── general/ (1,912个文件：.md + .pdf配对文件) └── tables/ (920个文件：.md + .pdf配对文件) ### 数据集使用方法 1. 可使用该数据集测试自研OCR模型的性能：本数据集为完全符合olmOCR-bench规范的基准测试集，可通过位于[此处](https://github.com/allenai/olmocr/tree/main/olmocr/bench)的标准olmOCR-bench工具运行测试。 2. 也可使用GRPO或同类强化学习技术在该数据集上训练模型，详细信息可参阅[olmOCR训练代码](https://github.com/allenai/olmocr/tree/main/olmocr/train)。 ### 数据集构建方式请参阅olmOCR仓库中的[mine_html_templates.py](https://github.com/allenai/olmocr/blob/main/olmocr/bench/synth/mine_html_templates.py)脚本。该脚本针对5类不同的PDF子集执行构建： - `arxiv_cs`、`arxiv_math`、`arxiv_physics`分别为对应学科领域的最新arXiv论文采样样本； - `general`子集的文件采样自与olmOCR-mix-0225相同的内部网络PDF爬取库； - `tables`子集的文件同样采样自olmOCR-mix-0225所用的内部网络PDF爬取库，但通过调用GPT-4o的脚本过滤出了包含表格的页面。 ## 许可证本数据集采用ODC-BY许可证开源，仅可用于研究与教育用途，并需遵循AI2的[负责任使用指南](https://allenai.org/responsible-use)。

应用场景：