five

Corvus-OCR-Caption-Mini-Mix

收藏
魔搭社区2025-12-03 更新2025-07-19 收录
下载链接:
https://modelscope.cn/datasets/prithivMLmods/Corvus-OCR-Caption-Mini-Mix
下载链接
链接失效反馈
官方服务:
资源简介:
# **Corvus-OCR-Caption-Mini-Mix** **Corvus-OCR-Caption-Mini-Mix** is a high-quality, compact image-caption dataset designed for training and evaluating image-to-text models. It is a carefully curated subset of the larger [`BLIP3o/BLIP3o-Pretrain-Long-Caption`](https://huggingface.co/datasets/BLIP3o/BLIP3o-Pretrain-Long-Caption), optimized for mixed OCR and long-form captioning tasks. ## Dataset Summary This dataset contains a balanced mix of: * Long-form natural language captions * OCR-heavy samples with scientific, mathematical, and document-style content It is especially suitable for models that require both visual understanding and textual reasoning, such as those used in document intelligence, scientific paper analysis, and complex scene captioning. ## Features * **Images:** Diverse scenes including real-world, street views, documents, and mathematical notations * **Captions:** Textual descriptions or OCR content, often in LaTeX or natural language * **Languages:** Primarily English and Chinese * **Data Format:** Arrow * **License:** Apache 2.0 ## Dataset Details * **Split:** `train` only * **Rows:** 78,964 * **Size:** 850 MB (raw), 847 MB (Parquet) | Column | Type | Description | | ------ | ------ | ----------------------------------- | | image | image | Input image | | text | string | Corresponding caption or OCR output | ## Use Cases * Image-to-text pretraining * OCR-based captioning * Scientific document modeling * Evaluation of multimodal reasoning ## Citation If you use this dataset, please cite the original dataset: > **BLIP3o/BLIP3o-Pretrain-Long-Caption** > [https://huggingface.co/datasets/BLIP3o/BLIP3o-Pretrain-Long-Caption](https://huggingface.co/datasets/BLIP3o/BLIP3o-Pretrain-Long-Caption) And reference this curated derivative: > **Corvus-OCR-Caption-Mini-Mix by prithivMLmods** ## Related Collection This dataset is part of the [`Corvus-OCR-Caption-Mix`](https://huggingface.co/collections/prithivMLmods/Corvus-OCR-Caption-Mix) collection, which includes multiple variants supporting variable-dimensional image-text pretraining.

# **Corvus-OCR-Caption-Mini-Mix** **Corvus-OCR-Caption-Mini-Mix** 是一款高质量轻量化图像-文本数据集,专为图像到文本模型的训练与评估设计。它是大型数据集 [`BLIP3o/BLIP3o-Pretrain-Long-Caption`](https://huggingface.co/datasets/BLIP3o/BLIP3o-Pretrain-Long-Caption) 的精心筛选子集,针对混合OCR (Optical Character Recognition,光学字符识别) 与长文本字幕任务进行了优化。 ## 数据集摘要 本数据集包含均衡混合的两类样本: * 长文本自然语言字幕 * 涵盖科学、数学与文档类内容的高OCR占比样本 本数据集尤其适用于同时需要视觉理解与文本推理的模型,例如应用于文档智能、科学论文分析以及复杂场景字幕生成的模型。 ## 数据集特征 * **图像**:涵盖真实场景、街景、文档与数学符号等多样化场景 * **字幕**:文本描述或OCR识别内容,通常采用LaTeX格式或自然语言 * **语言**:主要包含英语与汉语 * **数据格式**:Arrow格式 * **许可证**:Apache 2.0 ## 数据集详情 * **划分方式**:仅包含训练集(train) * **样本总量**:78964条 * **数据大小**:原始数据850 MB,Parquet格式数据847 MB | 列名 | 数据类型 | 描述 | | ------ | ------ | ----------------------------------- | | image | 图像 | 输入图像 | | text | 字符串 | 对应的字幕或OCR识别输出结果 | ## 应用场景 * 图像到文本模型预训练 * 基于OCR的字幕生成任务 * 科学文档建模 * 多模态推理能力评估 ## 引用说明 若使用本数据集,请同时引用原始数据集与本衍生数据集: > **BLIP3o/BLIP3o-Pretrain-Long-Caption** > [https://huggingface.co/datasets/BLIP3o/BLIP3o-Pretrain-Long-Caption](https://huggingface.co/datasets/BLIP3o/BLIP3o-Pretrain-Long-Caption) 同时请标注本精心筛选的衍生数据集: > **Corvus-OCR-Caption-Mini-Mix by prithivMLmods** ## 相关合集 本数据集隶属于[`Corvus-OCR-Caption-Mix`](https://huggingface.co/collections/prithivMLmods/Corvus-OCR-Caption-Mix)合集,该合集包含多个支持可变维度图像-文本预训练的数据集变体。
提供机构:
maas
创建时间:
2025-07-13
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作