Corvus-OCR-Caption-Mini-Mix

Name: Corvus-OCR-Caption-Mini-Mix
Creator: maas
Published: 2025-12-03 17:17:25
License: 暂无描述

魔搭社区2025-12-03 更新2025-07-19 收录

下载链接：

https://modelscope.cn/datasets/prithivMLmods/Corvus-OCR-Caption-Mini-Mix

下载链接

链接失效反馈

官方服务：

资源简介：

# **Corvus-OCR-Caption-Mini-Mix** **Corvus-OCR-Caption-Mini-Mix** is a high-quality, compact image-caption dataset designed for training and evaluating image-to-text models. It is a carefully curated subset of the larger [`BLIP3o/BLIP3o-Pretrain-Long-Caption`](https://huggingface.co/datasets/BLIP3o/BLIP3o-Pretrain-Long-Caption), optimized for mixed OCR and long-form captioning tasks. ## Dataset Summary This dataset contains a balanced mix of: * Long-form natural language captions * OCR-heavy samples with scientific, mathematical, and document-style content It is especially suitable for models that require both visual understanding and textual reasoning, such as those used in document intelligence, scientific paper analysis, and complex scene captioning. ## Features * **Images:** Diverse scenes including real-world, street views, documents, and mathematical notations * **Captions:** Textual descriptions or OCR content, often in LaTeX or natural language * **Languages:** Primarily English and Chinese * **Data Format:** Arrow * **License:** Apache 2.0 ## Dataset Details * **Split:** `train` only * **Rows:** 78,964 * **Size:** 850 MB (raw), 847 MB (Parquet) | Column | Type | Description | | ------ | ------ | ----------------------------------- | | image | image | Input image | | text | string | Corresponding caption or OCR output | ## Use Cases * Image-to-text pretraining * OCR-based captioning * Scientific document modeling * Evaluation of multimodal reasoning ## Citation If you use this dataset, please cite the original dataset: > **BLIP3o/BLIP3o-Pretrain-Long-Caption** > [https://huggingface.co/datasets/BLIP3o/BLIP3o-Pretrain-Long-Caption](https://huggingface.co/datasets/BLIP3o/BLIP3o-Pretrain-Long-Caption) And reference this curated derivative: > **Corvus-OCR-Caption-Mini-Mix by prithivMLmods** ## Related Collection This dataset is part of the [`Corvus-OCR-Caption-Mix`](https://huggingface.co/collections/prithivMLmods/Corvus-OCR-Caption-Mix) collection, which includes multiple variants supporting variable-dimensional image-text pretraining.

# **Corvus-OCR-Caption-Mini-Mix** **Corvus-OCR-Caption-Mini-Mix** 是一款高质量轻量化图像-文本数据集，专为图像到文本模型的训练与评估设计。它是大型数据集 [`BLIP3o/BLIP3o-Pretrain-Long-Caption`](https://huggingface.co/datasets/BLIP3o/BLIP3o-Pretrain-Long-Caption) 的精心筛选子集，针对混合OCR (Optical Character Recognition，光学字符识别) 与长文本字幕任务进行了优化。 ## 数据集摘要本数据集包含均衡混合的两类样本： * 长文本自然语言字幕 * 涵盖科学、数学与文档类内容的高OCR占比样本本数据集尤其适用于同时需要视觉理解与文本推理的模型，例如应用于文档智能、科学论文分析以及复杂场景字幕生成的模型。 ## 数据集特征 * **图像**：涵盖真实场景、街景、文档与数学符号等多样化场景 * **字幕**：文本描述或OCR识别内容，通常采用LaTeX格式或自然语言 * **语言**：主要包含英语与汉语 * **数据格式**：Arrow格式 * **许可证**：Apache 2.0 ## 数据集详情 * **划分方式**：仅包含训练集（train） * **样本总量**：78964条 * **数据大小**：原始数据850 MB，Parquet格式数据847 MB | 列名 | 数据类型 | 描述 | | ------ | ------ | ----------------------------------- | | image | 图像 | 输入图像 | | text | 字符串 | 对应的字幕或OCR识别输出结果 | ## 应用场景 * 图像到文本模型预训练 * 基于OCR的字幕生成任务 * 科学文档建模 * 多模态推理能力评估 ## 引用说明若使用本数据集，请同时引用原始数据集与本衍生数据集： > **BLIP3o/BLIP3o-Pretrain-Long-Caption** > [https://huggingface.co/datasets/BLIP3o/BLIP3o-Pretrain-Long-Caption](https://huggingface.co/datasets/BLIP3o/BLIP3o-Pretrain-Long-Caption) 同时请标注本精心筛选的衍生数据集： > **Corvus-OCR-Caption-Mini-Mix by prithivMLmods** ## 相关合集本数据集隶属于[`Corvus-OCR-Caption-Mix`](https://huggingface.co/collections/prithivMLmods/Corvus-OCR-Caption-Mix)合集，该合集包含多个支持可变维度图像-文本预训练的数据集变体。

提供机构：

maas

创建时间：

2025-07-13

5,000+

优质数据集

54 个

任务类型

进入经典数据集