Corvus-OCR-Caption-Mini-Mix
收藏魔搭社区2025-12-03 更新2025-07-19 收录
下载链接:
https://modelscope.cn/datasets/prithivMLmods/Corvus-OCR-Caption-Mini-Mix
下载链接
链接失效反馈官方服务:
资源简介:
# **Corvus-OCR-Caption-Mini-Mix**
**Corvus-OCR-Caption-Mini-Mix** is a high-quality, compact image-caption dataset designed for training and evaluating image-to-text models. It is a carefully curated subset of the larger [`BLIP3o/BLIP3o-Pretrain-Long-Caption`](https://huggingface.co/datasets/BLIP3o/BLIP3o-Pretrain-Long-Caption), optimized for mixed OCR and long-form captioning tasks.
## Dataset Summary
This dataset contains a balanced mix of:
* Long-form natural language captions
* OCR-heavy samples with scientific, mathematical, and document-style content
It is especially suitable for models that require both visual understanding and textual reasoning, such as those used in document intelligence, scientific paper analysis, and complex scene captioning.
## Features
* **Images:** Diverse scenes including real-world, street views, documents, and mathematical notations
* **Captions:** Textual descriptions or OCR content, often in LaTeX or natural language
* **Languages:** Primarily English and Chinese
* **Data Format:** Arrow
* **License:** Apache 2.0
## Dataset Details
* **Split:** `train` only
* **Rows:** 78,964
* **Size:** 850 MB (raw), 847 MB (Parquet)
| Column | Type | Description |
| ------ | ------ | ----------------------------------- |
| image | image | Input image |
| text | string | Corresponding caption or OCR output |
## Use Cases
* Image-to-text pretraining
* OCR-based captioning
* Scientific document modeling
* Evaluation of multimodal reasoning
## Citation
If you use this dataset, please cite the original dataset:
> **BLIP3o/BLIP3o-Pretrain-Long-Caption**
> [https://huggingface.co/datasets/BLIP3o/BLIP3o-Pretrain-Long-Caption](https://huggingface.co/datasets/BLIP3o/BLIP3o-Pretrain-Long-Caption)
And reference this curated derivative:
> **Corvus-OCR-Caption-Mini-Mix by prithivMLmods**
## Related Collection
This dataset is part of the [`Corvus-OCR-Caption-Mix`](https://huggingface.co/collections/prithivMLmods/Corvus-OCR-Caption-Mix) collection, which includes multiple variants supporting variable-dimensional image-text pretraining.
# **Corvus-OCR-Caption-Mini-Mix**
**Corvus-OCR-Caption-Mini-Mix** 是一款高质量轻量化图像-文本数据集,专为图像到文本模型的训练与评估设计。它是大型数据集 [`BLIP3o/BLIP3o-Pretrain-Long-Caption`](https://huggingface.co/datasets/BLIP3o/BLIP3o-Pretrain-Long-Caption) 的精心筛选子集,针对混合OCR (Optical Character Recognition,光学字符识别) 与长文本字幕任务进行了优化。
## 数据集摘要
本数据集包含均衡混合的两类样本:
* 长文本自然语言字幕
* 涵盖科学、数学与文档类内容的高OCR占比样本
本数据集尤其适用于同时需要视觉理解与文本推理的模型,例如应用于文档智能、科学论文分析以及复杂场景字幕生成的模型。
## 数据集特征
* **图像**:涵盖真实场景、街景、文档与数学符号等多样化场景
* **字幕**:文本描述或OCR识别内容,通常采用LaTeX格式或自然语言
* **语言**:主要包含英语与汉语
* **数据格式**:Arrow格式
* **许可证**:Apache 2.0
## 数据集详情
* **划分方式**:仅包含训练集(train)
* **样本总量**:78964条
* **数据大小**:原始数据850 MB,Parquet格式数据847 MB
| 列名 | 数据类型 | 描述 |
| ------ | ------ | ----------------------------------- |
| image | 图像 | 输入图像 |
| text | 字符串 | 对应的字幕或OCR识别输出结果 |
## 应用场景
* 图像到文本模型预训练
* 基于OCR的字幕生成任务
* 科学文档建模
* 多模态推理能力评估
## 引用说明
若使用本数据集,请同时引用原始数据集与本衍生数据集:
> **BLIP3o/BLIP3o-Pretrain-Long-Caption**
> [https://huggingface.co/datasets/BLIP3o/BLIP3o-Pretrain-Long-Caption](https://huggingface.co/datasets/BLIP3o/BLIP3o-Pretrain-Long-Caption)
同时请标注本精心筛选的衍生数据集:
> **Corvus-OCR-Caption-Mini-Mix by prithivMLmods**
## 相关合集
本数据集隶属于[`Corvus-OCR-Caption-Mix`](https://huggingface.co/collections/prithivMLmods/Corvus-OCR-Caption-Mix)合集,该合集包含多个支持可变维度图像-文本预训练的数据集变体。
提供机构:
maas
创建时间:
2025-07-13



