rainxx/Corvus-OCR-Caption-Mix
收藏Hugging Face2025-12-14 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/rainxx/Corvus-OCR-Caption-Mix
下载链接
链接失效反馈官方服务:
资源简介:
Corvus-OCR-Caption-Mix是一个高质量的、紧凑的图像-标题数据集,专为训练和评估图像到文本模型而设计。该数据集从更大的BLIP3o/BLIP3o-Pretrain-Long-Caption数据集衍生并优化而来,重点关注长形式标题和跨多种图像类型的混合OCR任务。数据集包含超过229,000个图像-标题对,涵盖了OCR丰富的文档(如数学表达式、LaTeX和技术符号)、描述性自然场景和艺术视觉效果的长形式标题、双语内容(英文和中文)以及适用于文档和场景理解的多领域示例。数据集的特点包括多样化的图像类型(如手写公式、印刷文本、文档、风景照片等)、长形式的标题文本(包括自然语言、LaTeX、数学表达式或技术信息)、双语支持(英文和中文)以及图像到文本的模态(OCR + 标题)。数据格式为Apache Arrow,许可证为Apache 2.0。
Corvus-OCR-Caption-Mix is a high-quality, compact image-caption dataset designed for training and evaluating image-to-text models. This collection is derived and optimized from the larger BLIP3o/BLIP3o-Pretrain-Long-Caption, with a focus on long-form captions and mixed OCR tasks across a variety of image types. The dataset spans over 229,000 image-caption pairs and provides a balanced blend of OCR-rich documents featuring mathematical expressions, LaTeX, and technical notations, descriptive natural scenes and artistic visuals with long-form captions, bilingual content in English and Chinese, and multi-domain examples suitable for both document and scene understanding. Features include a diverse selection of images (handwritten formulas, printed text, documents, scenic photos, etc.), long-form textual content with natural language, LaTeX, mathematical expressions, or technical information, bilingual support (English and Chinese), and modality of Image-to-Text (OCR + Caption). The data format is Apache Arrow, and the license is Apache 2.0.
提供机构:
rainxx



