five

Corvus-OCR-Caption-Mix

收藏
魔搭社区2026-01-06 更新2025-07-19 收录
下载链接:
https://modelscope.cn/datasets/prithivMLmods/Corvus-OCR-Caption-Mix
下载链接
链接失效反馈
资源简介:
# **Corvus-OCR-Caption-Mix** **Corvus-OCR-Caption-Mix** is a high-quality, compact image-caption dataset designed for training and evaluating image-to-text models. This collection is derived and optimized from the larger [`BLIP3o/BLIP3o-Pretrain-Long-Caption`](https://huggingface.co/datasets/BLIP3o/BLIP3o-Pretrain-Long-Caption), with a focus on long-form captions and mixed OCR tasks across a variety of image types. ## Dataset Summary The dataset spans over 229,000 image-caption pairs and provides a balanced blend of: * OCR-rich documents featuring mathematical expressions, LaTeX, and technical notations * Descriptive natural scenes and artistic visuals with long-form captions * Bilingual content in English and Chinese * Multi-domain examples suitable for both document and scene understanding ## Features * **Images:** A diverse selection including handwritten formulas, printed text, documents, scenic photos, and more * **Captions:** Long-form textual content with natural language, LaTeX, mathematical expressions, or technical information * **Languages:** English and Chinese * **Modality:** Image-to-Text (OCR + Caption) * **Data Format:** Apache Arrow * **License:** Apache 2.0 ## Dataset Details * **Split:** `train` * **Rows:** \~230,000 * **Storage Size:** 4.68 GB (first 49.9k rows) * **Estimated Total Size:** \~20+ GB for all rows | Column | Type | Description | | ------ | ------ | --------------------------------------- | | image | image | Input image (scene/document/screenshot) | | text | string | Caption text or OCR-extracted content | ## Use Cases * OCR pretraining and evaluation * Vision-language modeling on complex document layouts * LaTeX and mathematical expression extraction * Long-form captioning across real-world and academic content * Cross-lingual captioning and translation modeling ## Related Models Models fine-tuned on this dataset: * [Pollux-Caption-VL-2B](https://huggingface.co/prithivMLmods/Pollux-Caption-VL-2B) – A Qwen2.5-VL-based model optimized for OCR captioning tasks ## Citation If you use this dataset, please cite the original dataset: > **BLIP3o/BLIP3o-Pretrain-Long-Caption** > [https://huggingface.co/datasets/BLIP3o/BLIP3o-Pretrain-Long-Caption](https://huggingface.co/datasets/BLIP3o/BLIP3o-Pretrain-Long-Caption) And reference this curated derivative: > **Corvus-OCR-Caption-Mix by prithivMLmods** ## Related Collections This dataset is part of the [`Corvus-OCR-Caption-Mix`](https://huggingface.co/collections/prithivMLmods/Corvus-OCR-Caption-Mix) collection, which includes multiple subsets optimized for variable-dimension OCR captioning use cases.
提供机构:
maas
创建时间:
2025-07-13
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作