OpenDoc-Null-6K
收藏魔搭社区2025-12-03 更新2025-11-03 收录
下载链接:
https://modelscope.cn/datasets/prithivMLmods/OpenDoc-Null-6K
下载链接
链接失效反馈官方服务:
资源简介:
# OpenDoc-Null-6K
The **OpenDoc-Null-6K** dataset is curated for tasks related to image-to-text recognition, particularly for scanned document images and OCR (Optical Character Recognition) use cases. It contains over 6,900 images in a structured `imagefolder` format suitable for training models on document parsing, PDF image understanding, and layout/text extraction tasks.
| **Attribute** | **Value** |
|---------------|------------------------|
| Task | Image-to-Text |
| Modality | Image |
| Format | ImageFolder |
| Language | English |
| License | Apache 2.0 |
| Size | 1K - 10K samples |
| Split | train (6,910 samples) |
### Key Features
* Contains **6.91k** training samples of document-style images.
* Each sample is an **image**, with no associated text or label (raw OCR input).
* Dataset is auto-converted to **Parquet** format by Hugging Face for efficient streaming and processing.
* Suitable for OCR research, PDF document parsing, and code/text recognition tasks.
## Usage
You can load the dataset using the Hugging Face `datasets` library:
```python
from datasets import load_dataset
dataset = load_dataset("prithivMLmods/OpenDoc-Null-6K")
```
## File Size
* **Total download size**: \~2.72 GB
* **Auto-converted Parquet size**: \~2.71 GB
## License
This dataset is released under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0).
# OpenDoc-Null-6K 数据集
**OpenDoc-Null-6K** 数据集专为图像转文本识别相关任务打造,尤其适用于扫描文档图像与OCR(光学字符识别,Optical Character Recognition)场景。该数据集包含超过6900张采用结构化`imagefolder`格式存储的图像,可用于训练文档解析、PDF图像理解以及版面/文本提取任务的模型。
| **属性** | **取值** |
|---------|---------|
| 任务 | 图像转文本 |
| 模态 | 图像 |
| 格式 | ImageFolder |
| 语言 | 英语 |
| 许可证 | Apache 2.0 |
| 样本量 | 1K - 10K |
| 划分 | 训练集(6910个样本) |
### 核心特性
* 包含6910个文档风格图像的训练样本。
* 每个样本均为**图像**,无关联文本或标签(原始OCR输入数据)。
* 数据集已由Hugging Face自动转换为**Parquet**格式,以实现高效流式处理与运算。
* 适用于OCR研究、PDF文档解析以及代码/文本识别任务。
### 使用方法
您可通过Hugging Face的`datasets`库加载该数据集:
python
from datasets import load_dataset
dataset = load_dataset("prithivMLmods/OpenDoc-Null-6K")
### 文件大小
* 总下载大小:约2.72 GB
* 自动转换后的Parquet格式大小:约2.71 GB
### 许可证
本数据集采用 [Apache 2.0 许可证](https://www.apache.org/licenses/LICENSE-2.0) 发布。
提供机构:
maas
创建时间:
2025-09-11



