Corvus-OCR-Caption-Mix

Name: Corvus-OCR-Caption-Mix
Creator: maas
Published: 2026-01-06 16:38:41
License: 暂无描述

魔搭社区2026-01-06 更新2025-07-19 收录

下载链接：

https://modelscope.cn/datasets/prithivMLmods/Corvus-OCR-Caption-Mix

下载链接

链接失效反馈

官方服务：

资源简介：

# **Corvus-OCR-Caption-Mix** **Corvus-OCR-Caption-Mix** is a high-quality, compact image-caption dataset designed for training and evaluating image-to-text models. This collection is derived and optimized from the larger [`BLIP3o/BLIP3o-Pretrain-Long-Caption`](https://huggingface.co/datasets/BLIP3o/BLIP3o-Pretrain-Long-Caption), with a focus on long-form captions and mixed OCR tasks across a variety of image types. ## Dataset Summary The dataset spans over 229,000 image-caption pairs and provides a balanced blend of: * OCR-rich documents featuring mathematical expressions, LaTeX, and technical notations * Descriptive natural scenes and artistic visuals with long-form captions * Bilingual content in English and Chinese * Multi-domain examples suitable for both document and scene understanding ## Features * **Images:** A diverse selection including handwritten formulas, printed text, documents, scenic photos, and more * **Captions:** Long-form textual content with natural language, LaTeX, mathematical expressions, or technical information * **Languages:** English and Chinese * **Modality:** Image-to-Text (OCR + Caption) * **Data Format:** Apache Arrow * **License:** Apache 2.0 ## Dataset Details * **Split:** `train` * **Rows:** \~230,000 * **Storage Size:** 4.68 GB (first 49.9k rows) * **Estimated Total Size:** \~20+ GB for all rows | Column | Type | Description | | ------ | ------ | --------------------------------------- | | image | image | Input image (scene/document/screenshot) | | text | string | Caption text or OCR-extracted content | ## Use Cases * OCR pretraining and evaluation * Vision-language modeling on complex document layouts * LaTeX and mathematical expression extraction * Long-form captioning across real-world and academic content * Cross-lingual captioning and translation modeling ## Related Models Models fine-tuned on this dataset: * [Pollux-Caption-VL-2B](https://huggingface.co/prithivMLmods/Pollux-Caption-VL-2B) – A Qwen2.5-VL-based model optimized for OCR captioning tasks ## Citation If you use this dataset, please cite the original dataset: > **BLIP3o/BLIP3o-Pretrain-Long-Caption** > [https://huggingface.co/datasets/BLIP3o/BLIP3o-Pretrain-Long-Caption](https://huggingface.co/datasets/BLIP3o/BLIP3o-Pretrain-Long-Caption) And reference this curated derivative: > **Corvus-OCR-Caption-Mix by prithivMLmods** ## Related Collections This dataset is part of the [`Corvus-OCR-Caption-Mix`](https://huggingface.co/collections/prithivMLmods/Corvus-OCR-Caption-Mix) collection, which includes multiple subsets optimized for variable-dimension OCR captioning use cases.

# **Corvus-OCR-Caption-Mix** **Corvus-OCR-Caption-Mix** 是一款高质量、轻量化的图像-文本数据集，专为图像转文本模型的训练与评估打造。该数据集源自更大规模的 [`BLIP3o/BLIP3o-Pretrain-Long-Caption`](https://huggingface.co/datasets/BLIP3o/BLIP3o-Pretrain-Long-Caption) 并经过优化，核心聚焦于长文本字幕与跨多种图像类型的混合光学字符识别 (Optical Character Recognition) 任务。 ## 数据集概览该数据集包含超过22.9万组图像-字幕对，均衡涵盖以下几类内容： * 包含数学表达式、LaTeX格式与技术符号的高光学字符识别密度文档 * 配有长文本字幕的自然场景与艺术视觉作品 * 中英双语内容 * 适用于文档与场景理解的多领域示例 ## 数据集特性 * **图像类型**：涵盖手写公式、印刷文本、文档、风景照片等多样化素材 * **字幕内容**：包含自然语言、LaTeX格式、数学表达式或技术信息的长文本语料 * **语言支持**：英语与汉语 * **模态类型**：图像-文本（光学字符识别+字幕） * **数据格式**：Apache Arrow * **开源协议**：Apache 2.0 ## 数据集详情 * **划分集**：`train`（训练集） * **数据行数**：约23万条 * **存储大小**：4.68 GB（前4.99万条数据） * **总预估存储大小**：全量数据约20 GB以上 | 列名 | 数据类型 | 描述说明 | | ------ | -------- | ------------------------------------------ | | image | image | 输入图像（场景/文档/截图） | | text | string | 字幕文本或经光学字符识别提取的内容 | ## 应用场景 * 光学字符识别预训练与效果评估 * 复杂文档布局下的视觉语言建模 * LaTeX格式与数学表达式提取 * 真实场景与学术内容下的长文本字幕生成 * 跨语言字幕生成与翻译建模 ## 相关模型基于本数据集微调的模型如下： * [Pollux-Caption-VL-2B](https://huggingface.co/prithivMLmods/Pollux-Caption-VL-2B) – 一款基于Qwen2.5-VL优化的光学字符识别字幕任务专用模型 ## 引用规范若使用本数据集，请引用原始数据集： > **BLIP3o/BLIP3o-Pretrain-Long-Caption** > [https://huggingface.co/datasets/BLIP3o/BLIP3o-Pretrain-Long-Caption](https://huggingface.co/datasets/BLIP3o/BLIP3o-Pretrain-Long-Caption) 同时请注明本经过整理的衍生数据集： > **Corvus-OCR-Caption-Mix by prithivMLmods** ## 相关数据集集合本数据集隶属于 [`Corvus-OCR-Caption-Mix`](https://huggingface.co/collections/prithivMLmods/Corvus-OCR-Caption-Mix) 数据集集合，该集合包含多个针对可变维度光学字符识别字幕任务优化的子集。

提供机构：

maas

创建时间：

2025-07-13

5,000+

优质数据集

54 个

任务类型

进入经典数据集