OpenDoc-Null-6K

Name: OpenDoc-Null-6K
Creator: maas
Published: 2025-12-03 17:17:25
License: 暂无描述

魔搭社区2025-12-03 更新2025-11-03 收录

下载链接：

https://modelscope.cn/datasets/prithivMLmods/OpenDoc-Null-6K

下载链接

链接失效反馈

官方服务：

资源简介：

# OpenDoc-Null-6K The **OpenDoc-Null-6K** dataset is curated for tasks related to image-to-text recognition, particularly for scanned document images and OCR (Optical Character Recognition) use cases. It contains over 6,900 images in a structured `imagefolder` format suitable for training models on document parsing, PDF image understanding, and layout/text extraction tasks. | **Attribute** | **Value** | |---------------|------------------------| | Task | Image-to-Text | | Modality | Image | | Format | ImageFolder | | Language | English | | License | Apache 2.0 | | Size | 1K - 10K samples | | Split | train (6,910 samples) | ### Key Features * Contains **6.91k** training samples of document-style images. * Each sample is an **image**, with no associated text or label (raw OCR input). * Dataset is auto-converted to **Parquet** format by Hugging Face for efficient streaming and processing. * Suitable for OCR research, PDF document parsing, and code/text recognition tasks. ## Usage You can load the dataset using the Hugging Face `datasets` library: ```python from datasets import load_dataset dataset = load_dataset("prithivMLmods/OpenDoc-Null-6K") ``` ## File Size * **Total download size**: \~2.72 GB * **Auto-converted Parquet size**: \~2.71 GB ## License This dataset is released under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0).

# OpenDoc-Null-6K 数据集 **OpenDoc-Null-6K** 数据集专为图像转文本识别相关任务打造，尤其适用于扫描文档图像与OCR（光学字符识别，Optical Character Recognition）场景。该数据集包含超过6900张采用结构化`imagefolder`格式存储的图像，可用于训练文档解析、PDF图像理解以及版面/文本提取任务的模型。 | **属性** | **取值** | |---------|---------| | 任务 | 图像转文本 | | 模态 | 图像 | | 格式 | ImageFolder | | 语言 | 英语 | | 许可证 | Apache 2.0 | | 样本量 | 1K - 10K | | 划分 | 训练集（6910个样本） | ### 核心特性 * 包含6910个文档风格图像的训练样本。 * 每个样本均为**图像**，无关联文本或标签（原始OCR输入数据）。 * 数据集已由Hugging Face自动转换为**Parquet**格式，以实现高效流式处理与运算。 * 适用于OCR研究、PDF文档解析以及代码/文本识别任务。 ### 使用方法您可通过Hugging Face的`datasets`库加载该数据集： python from datasets import load_dataset dataset = load_dataset("prithivMLmods/OpenDoc-Null-6K") ### 文件大小 * 总下载大小：约2.72 GB * 自动转换后的Parquet格式大小：约2.71 GB ### 许可证本数据集采用 [Apache 2.0 许可证](https://www.apache.org/licenses/LICENSE-2.0) 发布。

提供机构：

maas

创建时间：

2025-09-11

5,000+

优质数据集

54 个

任务类型

进入经典数据集