VillanovaAI/Multi-CoSyn-400K

Name: VillanovaAI/Multi-CoSyn-400K
Creator: VillanovaAI
Published: 2026-01-23 16:12:17
License: 暂无描述

Hugging Face2026-01-23 更新2026-02-07 收录

下载链接：

https://hf-mirror.com/datasets/VillanovaAI/Multi-CoSyn-400K

下载链接

链接失效反馈

官方服务：

资源简介：

Multi-CoSyn-400K是原始CoSyn-400K数据集的多语言扩展版本，包含基于文本丰富图像的问题-答案对及推理。数据集通过代码渲染工具和文本LLM生成图像和问答对，旨在为视觉LLM提供新的训练材料，以改善对文本丰富图像的推理能力。数据集包含10个子集，涵盖图表、化学、电路、图表、文档、图形、数学、音乐、营养和表格等类型，总计365,615个示例。问答对和推理使用Qwen3-VL-235B-A22B-Instruct模型重新生成，确保所有注释均可自由重用。数据集支持英语、意大利语、法语、西班牙语和德语五种语言，适用于多模态LLM的训练和多语言模型性能的基准测试。

Multi-CoSyn-400K is a multilingual extension of the original CoSyn-400K dataset, consisting of question-answer pairs with reasoning on text-rich images. The dataset was generated using code-based rendering tools and text-only LLMs to create new training material for vision LLMs, improving reasoning on text-rich images. It includes ten subsets covering chart, chemical, circuit, diagram, document, graphic, math, music, nutrition, and tables, totaling 365,615 examples. QA pairs and reasoning were re-generated using the Qwen3-VL-235B-A22B-Instruct model, ensuring all annotations are fully open for reuse. The dataset supports five languages: English, Italian, French, Spanish, and German, and is suitable for training multimodal LLMs and benchmarking model performance across languages.

提供机构：

VillanovaAI

5,000+

优质数据集

54 个

任务类型

进入经典数据集