Corvus-OCR-Caption-Mix
收藏魔搭社区2026-01-06 更新2025-07-19 收录
下载链接:
https://modelscope.cn/datasets/prithivMLmods/Corvus-OCR-Caption-Mix
下载链接
链接失效反馈官方服务:
资源简介:
# **Corvus-OCR-Caption-Mix**
**Corvus-OCR-Caption-Mix** is a high-quality, compact image-caption dataset designed for training and evaluating image-to-text models. This collection is derived and optimized from the larger [`BLIP3o/BLIP3o-Pretrain-Long-Caption`](https://huggingface.co/datasets/BLIP3o/BLIP3o-Pretrain-Long-Caption), with a focus on long-form captions and mixed OCR tasks across a variety of image types.
## Dataset Summary
The dataset spans over 229,000 image-caption pairs and provides a balanced blend of:
* OCR-rich documents featuring mathematical expressions, LaTeX, and technical notations
* Descriptive natural scenes and artistic visuals with long-form captions
* Bilingual content in English and Chinese
* Multi-domain examples suitable for both document and scene understanding
## Features
* **Images:** A diverse selection including handwritten formulas, printed text, documents, scenic photos, and more
* **Captions:** Long-form textual content with natural language, LaTeX, mathematical expressions, or technical information
* **Languages:** English and Chinese
* **Modality:** Image-to-Text (OCR + Caption)
* **Data Format:** Apache Arrow
* **License:** Apache 2.0
## Dataset Details
* **Split:** `train`
* **Rows:** \~230,000
* **Storage Size:** 4.68 GB (first 49.9k rows)
* **Estimated Total Size:** \~20+ GB for all rows
| Column | Type | Description |
| ------ | ------ | --------------------------------------- |
| image | image | Input image (scene/document/screenshot) |
| text | string | Caption text or OCR-extracted content |
## Use Cases
* OCR pretraining and evaluation
* Vision-language modeling on complex document layouts
* LaTeX and mathematical expression extraction
* Long-form captioning across real-world and academic content
* Cross-lingual captioning and translation modeling
## Related Models
Models fine-tuned on this dataset:
* [Pollux-Caption-VL-2B](https://huggingface.co/prithivMLmods/Pollux-Caption-VL-2B) – A Qwen2.5-VL-based model optimized for OCR captioning tasks
## Citation
If you use this dataset, please cite the original dataset:
> **BLIP3o/BLIP3o-Pretrain-Long-Caption**
> [https://huggingface.co/datasets/BLIP3o/BLIP3o-Pretrain-Long-Caption](https://huggingface.co/datasets/BLIP3o/BLIP3o-Pretrain-Long-Caption)
And reference this curated derivative:
> **Corvus-OCR-Caption-Mix by prithivMLmods**
## Related Collections
This dataset is part of the [`Corvus-OCR-Caption-Mix`](https://huggingface.co/collections/prithivMLmods/Corvus-OCR-Caption-Mix) collection, which includes multiple subsets optimized for variable-dimension OCR captioning use cases.
# **Corvus-OCR-Caption-Mix**
**Corvus-OCR-Caption-Mix** 是一款高质量、轻量化的图像-文本数据集,专为图像转文本模型的训练与评估打造。该数据集源自更大规模的 [`BLIP3o/BLIP3o-Pretrain-Long-Caption`](https://huggingface.co/datasets/BLIP3o/BLIP3o-Pretrain-Long-Caption) 并经过优化,核心聚焦于长文本字幕与跨多种图像类型的混合光学字符识别 (Optical Character Recognition) 任务。
## 数据集概览
该数据集包含超过22.9万组图像-字幕对,均衡涵盖以下几类内容:
* 包含数学表达式、LaTeX格式与技术符号的高光学字符识别密度文档
* 配有长文本字幕的自然场景与艺术视觉作品
* 中英双语内容
* 适用于文档与场景理解的多领域示例
## 数据集特性
* **图像类型**:涵盖手写公式、印刷文本、文档、风景照片等多样化素材
* **字幕内容**:包含自然语言、LaTeX格式、数学表达式或技术信息的长文本语料
* **语言支持**:英语与汉语
* **模态类型**:图像-文本(光学字符识别+字幕)
* **数据格式**:Apache Arrow
* **开源协议**:Apache 2.0
## 数据集详情
* **划分集**:`train`(训练集)
* **数据行数**:约23万条
* **存储大小**:4.68 GB(前4.99万条数据)
* **总预估存储大小**:全量数据约20 GB以上
| 列名 | 数据类型 | 描述说明 |
| ------ | -------- | ------------------------------------------ |
| image | image | 输入图像(场景/文档/截图) |
| text | string | 字幕文本或经光学字符识别提取的内容 |
## 应用场景
* 光学字符识别预训练与效果评估
* 复杂文档布局下的视觉语言建模
* LaTeX格式与数学表达式提取
* 真实场景与学术内容下的长文本字幕生成
* 跨语言字幕生成与翻译建模
## 相关模型
基于本数据集微调的模型如下:
* [Pollux-Caption-VL-2B](https://huggingface.co/prithivMLmods/Pollux-Caption-VL-2B) – 一款基于Qwen2.5-VL优化的光学字符识别字幕任务专用模型
## 引用规范
若使用本数据集,请引用原始数据集:
> **BLIP3o/BLIP3o-Pretrain-Long-Caption**
> [https://huggingface.co/datasets/BLIP3o/BLIP3o-Pretrain-Long-Caption](https://huggingface.co/datasets/BLIP3o/BLIP3o-Pretrain-Long-Caption)
同时请注明本经过整理的衍生数据集:
> **Corvus-OCR-Caption-Mix by prithivMLmods**
## 相关数据集集合
本数据集隶属于 [`Corvus-OCR-Caption-Mix`](https://huggingface.co/collections/prithivMLmods/Corvus-OCR-Caption-Mix) 数据集集合,该集合包含多个针对可变维度光学字符识别字幕任务优化的子集。
提供机构:
maas
创建时间:
2025-07-13



