five

olmOCR-synthmix-1025

收藏
魔搭社区2025-12-04 更新2025-11-29 收录
下载链接:
https://modelscope.cn/datasets/allenai/olmOCR-synthmix-1025
下载链接
链接失效反馈
官方服务:
资源简介:
# olmOCR-synthmix-1025 olmOCR-synthmix-1025 is a dataset of 2,186 single PDF pages, that have been synthetically rerendered into HTML by `claude-sonnet-4-20250514`. In total, across these PDF pages, 30,381 synthetic benchmark cases have been created, following the format of [olmOCR-bench](https://huggingface.co/datasets/allenai/olmOCR-bench). These documents contain no overlap with the original olmOCR-bench documents, and thus can be used as RLVR training data to improve the performance of OCR engines. ## Directory Structure ``` olmocr-synthmix-1025/ ├── bench_data/ # olmOCR-bench format benchmark data │ ├── *.jsonl # olmOCR-bench test cases (5 files) │ │ ├── arxiv_cs.jsonl │ │ ├── arxiv_math.jsonl │ │ ├── arxiv_physics.jsonl │ │ ├── general.jsonl │ │ └── tables.jsonl │ ├── pdfs/ # olmOCR-bench PDF files (2,187 files) │ │ ├── arxiv_cs/ (337 files) │ │ ├── arxiv_math/ (342 files) │ │ ├── arxiv_physics/ (102 files) │ │ ├── general/ (952 files) │ │ └── tables/ (454 files) │ └── claude_original/ # Original Claude OCR outputs (2,197 files) │ ├── arxiv_cs/ (337 files) │ ├── arxiv_math/ (342 files) │ ├── arxiv_physics/ (102 files) │ ├── general/ (956 files) │ └── tables/ (460 files) │ ├── html/ # HTML renders of each PDF page from Claude Sonnet(2,197 files) │ ├── arxiv_cs/ (337 files) │ ├── arxiv_math/ (342 files) │ ├── arxiv_physics/ (102 files) │ ├── general/ (956 files) │ └── tables/ (460 files) │ ├── metadata/ # Additional metadata for each PDF, contains original URLS (5 files) │ ├── arxiv_cs.jsonl │ ├── arxiv_math.jsonl │ ├── arxiv_physics.jsonl │ ├── general.jsonl │ └── tables.jsonl │ ├── pdfs/ # Side-by-side Original and HTML-rerendered PDFs (4,394 files) │ ├── arxiv_cs/ (674 files) │ ├── arxiv_math/ (684 files) │ ├── arxiv_physics/ (204 files) │ ├── general/ (1,912 files) │ └── tables/ (920 files) │ └── training/ # Training data with markdown + PDFs (4,394 files) ├── arxiv_cs/ (674 files: .md + .pdf pairs) ├── arxiv_math/ (684 files: .md + .pdf pairs) ├── arxiv_physics/ (204 files: .md + .pdf pairs) ├── general/ (1,912 files: .md + .pdf pairs) └── tables/ (920 files: .md + .pdf pairs) ``` ### How to use this dataset 1. You may test your own OCR model's performance on this, it is a perfectly valid olmOCR-bench style benchmark that can be run using the standard olmOCR-bench tools located [here](https://github.com/allenai/olmocr/tree/main/olmocr/bench). 2. You may also choose to train your model using GRPO or similar techniques on this data. See the [olmOCR trainer code](https://github.com/allenai/olmocr/tree/main/olmocr/train) for more details. ### How this dataset was made Please see the [mine_html_templates.py](https://github.com/allenai/olmocr/blob/main/olmocr/bench/synth/mine_html_templates.py) script in the olmOCR repo. This script was run against 5 different subsets of PDFs. `arxiv_cs`, `arxiv_math`, `arxiv_physics` each were samples of recent arxiv papers from those subsets. `general` consists of files sampled from the same internal crawl of web-pdfs as olmOCR-mix-0225 uses. `tables` consists of files sampled from the same internal crawl of web-pdfs as olmOCR-mix-0225 uses, but filtered to pages that include a table using a script that prompts gpt-4o. # License This dataset is licensed under ODC-BY. It is intended for research and educational use in accordance with Ai2's [Responsible Use Guidelines](https://allenai.org/responsible-use).

# olmOCR-synthmix-1025 olmOCR-synthmix-1025 是一个包含2186个单页PDF的数据集,这些PDF已由`claude-sonnet-4-20250514`合成重渲染为HTML格式。 整体而言,基于这些PDF页面共生成了30381个合成基准测试用例,格式遵循[olmOCR-bench](https://huggingface.co/datasets/allenai/olmOCR-bench)规范。 本数据集的文档与原始olmOCR-bench文档无重叠,因此可作为RLVR训练数据,用于提升光学字符识别(Optical Character Recognition, OCR)引擎的性能。 ## 目录结构 olmocr-synthmix-1025/ ├── bench_data/ # olmOCR-bench格式基准测试数据 │ ├── *.jsonl # olmOCR-bench测试用例(共5个文件) │ │ ├── arxiv_cs.jsonl │ │ ├── arxiv_math.jsonl │ │ ├── arxiv_physics.jsonl │ │ ├── general.jsonl │ │ └── tables.jsonl │ ├── pdfs/ # olmOCR-bench所用PDF文件(共2187个文件) │ │ ├── arxiv_cs/ (337个文件) │ │ ├── arxiv_math/ (342个文件) │ │ ├── arxiv_physics/ (102个文件) │ │ ├── general/ (952个文件) │ │ └── tables/ (454个文件) │ ├── claude_original/ # Claude原始OCR输出结果(共2197个文件) │ ├── arxiv_cs/ (337个文件) │ ├── arxiv_math/ (342个文件) │ ├── arxiv_physics/ (102个文件) │ ├── general/ (956个文件) │ └── tables/ (460个文件) │ ├── html/ # 各PDF页面的HTML渲染结果(共2197个文件) │ ├── arxiv_cs/ (337个文件) │ ├── arxiv_math/ (342个文件) │ ├── arxiv_physics/ (102个文件) │ ├── general/ (956个文件) │ └── tables/ (460个文件) │ ├── metadata/ # 各PDF的附加元数据,包含原始URL(共5个文件) │ ├── arxiv_cs.jsonl │ ├── arxiv_math.jsonl │ ├── arxiv_physics.jsonl │ ├── general.jsonl │ └── tables.jsonl │ ├── pdfs/ # 原始PDF与HTML重渲染PDF的并排版本(共4394个文件) │ ├── arxiv_cs/ (674个文件) │ ├── arxiv_math/ (684个文件) │ ├── arxiv_physics/ (204个文件) │ ├── general/ (1,912个文件) │ └── tables/ (920个文件) │ └── training/ # 包含Markdown与PDF的训练数据(共4394个文件) ├── arxiv_cs/ (674个文件:.md + .pdf配对文件) ├── arxiv_math/ (684个文件:.md + .pdf配对文件) ├── arxiv_physics/ (204个文件:.md + .pdf配对文件) ├── general/ (1,912个文件:.md + .pdf配对文件) └── tables/ (920个文件:.md + .pdf配对文件) ### 数据集使用方法 1. 可使用该数据集测试自研OCR模型的性能:本数据集为完全符合olmOCR-bench规范的基准测试集,可通过位于[此处](https://github.com/allenai/olmocr/tree/main/olmocr/bench)的标准olmOCR-bench工具运行测试。 2. 也可使用GRPO或同类强化学习技术在该数据集上训练模型,详细信息可参阅[olmOCR训练代码](https://github.com/allenai/olmocr/tree/main/olmocr/train)。 ### 数据集构建方式 请参阅olmOCR仓库中的[mine_html_templates.py](https://github.com/allenai/olmocr/blob/main/olmocr/bench/synth/mine_html_templates.py)脚本。该脚本针对5类不同的PDF子集执行构建: - `arxiv_cs`、`arxiv_math`、`arxiv_physics`分别为对应学科领域的最新arXiv论文采样样本; - `general`子集的文件采样自与olmOCR-mix-0225相同的内部网络PDF爬取库; - `tables`子集的文件同样采样自olmOCR-mix-0225所用的内部网络PDF爬取库,但通过调用GPT-4o的脚本过滤出了包含表格的页面。 ## 许可证 本数据集采用ODC-BY许可证开源,仅可用于研究与教育用途,并需遵循AI2的[负责任使用指南](https://allenai.org/responsible-use)。
提供机构:
maas
创建时间:
2025-10-22
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作