olmOCR-mix-0225
收藏魔搭社区2026-04-28 更新2025-03-01 收录
下载链接:
https://modelscope.cn/datasets/allenai/olmOCR-mix-0225
下载链接
链接失效反馈官方服务:
资源简介:
# olmOCR-mix-0225
olmOCR-mix-0225 is a dataset of ~250,000 PDF pages which have been OCRed into plain-text in a natural reading order using gpt-4o-2024-08-06 and a special
prompting strategy that preserves any born-digital content from each page.
This dataset can be used to train, fine-tune, or evaluate your own OCR document pipeline.
Quick links:
- 📃 [Paper](https://olmocr.allenai.org/papers/olmocr.pdf)
- 🤗 [Model](https://huggingface.co/allenai/olmOCR-7B-0225-preview)
- 🛠️ [Code](https://github.com/allenai/olmocr)
- 🎮 [Demo](https://olmocr.allenai.org/)
## Data Mix
## Table 1: Training set composition by source
| Source | Unique docs | Total pages |
|--------|-------------|-------------|
| Web crawled PDFs | 99,903 | 249,332 |
| Internet Archive books | 5,601 | 16,803 |
| **Total** | **105,504** | **266,135** |
Web crawled PDFs are sampled from a set of over 240 million documents crawled from public websites. Books in the Internet Archive set are in the public domain.
## Table 2: Web PDFs breakdown by document type
| Document type | Fraction |
|---------------|----------|
| Academic | 60% |
| Brochure | 12% |
| Legal | 11% |
| Table | 6% |
| Diagram | 5% |
| Slideshow | 2% |
| Other | 4% |
Distribution is estimating by sampling 707 pages, which are classified using *gpt-4o-2024-11-20*.
## Data Format
Each row in the dataset corresponds to a single page, extracted at random, from a source PDF and transformed into plain text.
No source PDF has had more than 3 random pages extracted from it.
Each extracted page is available as a standalone .pdf file, under the `pdf_tarballs/` directory.
### Features:
```python
{
'url': string, # Original URL of the PDF document
'page_number': int, # Page number within the document, 1-indexed
'id': string, # ID into /pdfs files folder
'response': { # OCRed Page information as JSON blob
'primary_language': string,
'is_rotation_valid': bool,
'rotation_correction': int,
'is_table': bool,
'is_diagram': bool,
'natural_text': str # The actual text of the PDF is here
}
}
```
## License
This dataset is licensed under ODC-BY-1.0. It is intended for research and educational use in accordance with AI2's [Responsible Use Guidelines](https://allenai.org/responsible-use).
The responses were generated from GPT-4o and GPT-4o is subject to OpenAI's [terms of use](https://openai.com/policies/row-terms-of-use).
# olmOCR-mix-0225
olmOCR-mix-0225 是一个包含约25万页PDF的数据集,该数据集已使用`gpt-4o-2024-08-06`与特殊提示策略,按照自然阅读顺序将PDF内容光学字符识别(Optical Character Recognition,OCR)为纯文本,同时保留每一页的原生数字内容。
本数据集可用于训练、微调或评估自研的OCR文档处理流水线。
快速访问入口:
- 📃 [论文](https://olmocr.allenai.org/papers/olmocr.pdf)
- 🤗 [模型](https://huggingface.co/allenai/olmOCR-7B-0225-preview)
- 🛠️ [代码](https://github.com/allenai/olmocr)
- 🎮 [演示](https://olmocr.allenai.org/)
## 数据集构成
### 表1:训练集来源构成
| 来源 | 唯一文档数 | 总页数 |
|--------|-------------|-------------|
| 网页爬取PDF | 99,903 | 249,332 |
| 互联网档案馆图书 | 5,601 | 16,803 |
| **总计** | **105,504** | **266,135** |
网页爬取PDF样本源自公开网站爬取的超2.4亿份文档集合;互联网档案馆图书集中的图书均属于公有领域范畴。
### 表2:网页PDF文档类型细分
| 文档类型 | 占比 |
|---------------|----------|
| 学术文献 | 60% |
| 宣传册 | 12% |
| 法律文档 | 11% |
| 表格 | 6% |
| 示意图 | 5% |
| 演示文稿 | 2% |
| 其他 | 4% |
该分布通过抽样707页并使用`gpt-4o-2024-11-20`进行分类后估算得到。
## 数据格式
数据集中的每一行对应一份从源PDF中随机抽取的单页内容,并已转换为纯文本格式。单份源PDF最多仅抽取3页随机样本。所有抽取的页面均以独立PDF文件形式存储于`pdf_tarballs/`目录下。
### 特征字段:
python
{
'url': 字符串类型, # PDF文档的原始URL
'page_number': 整数类型, # 文档内页码,采用1索引制
'id': 字符串类型, # 对应/pdfs文件目录的ID标识
'response': { # OCR页面信息的JSON二进制对象
'primary_language': 字符串类型, # 页面主要语言
'is_rotation_valid': 布尔类型, # 旋转校正是否有效
'rotation_correction': 整数类型, # 旋转校正角度
'is_table': 布尔类型, # 是否为表格类页面
'is_diagram': 布尔类型, # 是否为示意图类页面
'natural_text': 字符串类型 # PDF实际文本内容
}
}
## 许可协议
本数据集采用ODC-BY-1.0许可协议发布,仅可用于研究与教育用途,需遵守AI2的[负责任使用指南](https://allenai.org/responsible-use)。本数据集的OCR响应由GPT-4o生成,GPT-4o的使用需遵守OpenAI的[服务条款](https://openai.com/policies/row-terms-of-use).
提供机构:
maas
创建时间:
2025-05-29



