ocr-annotations
收藏魔搭社区2026-01-07 更新2025-11-03 收录
下载链接:
https://modelscope.cn/datasets/HuggingFaceFW/ocr-annotations
下载链接
链接失效反馈官方服务:
资源简介:

# PDF OCR Classification Dataset
This dataset contains PDF documents with annotations for OCR classification tasks.
## Dataset Description
- **Total samples**: 1620
- **Classes**: OCR (requires OCR processing), NOCR (no OCR needed)
## Dataset Structure
Each row contains:
- `filename`: Original PDF filename
- `pdf`: PDF file as binary data (using Pdf feature type)
- `class`: Binary classification label (OCR/NOCR)
- `truncation_type`: Whether the PDF is truncated or non-truncated
- `pdf_size_bytes`: Size of the PDF file in bytes
## Class Distribution
class
NOCR 1393
OCR 227
## Usage
```python
from datasets import load_dataset
# Load the dataset
dataset = load_dataset("HuggingFaceFW/ocr-annotations")
# Access train split
train_data = dataset['train']
# Access a sample
sample = train_data[0]
pdf_bytes = sample['pdf'] # This will be bytes
label = sample['class']
```
## License
Please check the original data source for licensing information.

# PDF OCR分类数据集
本数据集包含带有标注的PDF文档,用于OCR(Optical Character Recognition,光学字符识别)分类任务。
## 数据集说明
- **总样本量**:1620
- **类别**:OCR(需进行OCR处理)、NOCR(无需OCR处理)
## 数据集结构
每条样本包含以下字段:
- `filename`:原始PDF文件名
- `pdf`:以二进制数据形式存储的PDF文件(采用Pdf特征类型)
- `class`:二分类标签(OCR/NOCR)
- `truncation_type`:用于标识PDF是否存在截断的字段
- `pdf_size_bytes`:PDF文件的字节大小
## 类别分布
class
NOCR 1393
OCR 227
## 使用方法
python
from datasets import load_dataset
# 加载数据集
dataset = load_dataset("HuggingFaceFW/ocr-annotations")
# 访问训练子集
train_data = dataset['train']
# 访问单个样本
sample = train_data[0]
pdf_bytes = sample['pdf'] # 该值为字节类型的PDF数据
label = sample['class']
## 许可证
请查阅原始数据源以获取许可证相关信息。
提供机构:
maas
创建时间:
2025-10-15



