VillanovaAI/Multi-CoSyn-400K
收藏Hugging Face2026-01-23 更新2026-02-07 收录
下载链接:
https://hf-mirror.com/datasets/VillanovaAI/Multi-CoSyn-400K
下载链接
链接失效反馈官方服务:
资源简介:
Multi-CoSyn-400K是原始CoSyn-400K数据集的多语言扩展版本,包含基于文本丰富图像的问题-答案对及推理。数据集通过代码渲染工具和文本LLM生成图像和问答对,旨在为视觉LLM提供新的训练材料,以改善对文本丰富图像的推理能力。数据集包含10个子集,涵盖图表、化学、电路、图表、文档、图形、数学、音乐、营养和表格等类型,总计365,615个示例。问答对和推理使用Qwen3-VL-235B-A22B-Instruct模型重新生成,确保所有注释均可自由重用。数据集支持英语、意大利语、法语、西班牙语和德语五种语言,适用于多模态LLM的训练和多语言模型性能的基准测试。
Multi-CoSyn-400K is a multilingual extension of the original CoSyn-400K dataset, consisting of question-answer pairs with reasoning on text-rich images. The dataset was generated using code-based rendering tools and text-only LLMs to create new training material for vision LLMs, improving reasoning on text-rich images. It includes ten subsets covering chart, chemical, circuit, diagram, document, graphic, math, music, nutrition, and tables, totaling 365,615 examples. QA pairs and reasoning were re-generated using the Qwen3-VL-235B-A22B-Instruct model, ensuring all annotations are fully open for reuse. The dataset supports five languages: English, Italian, French, Spanish, and German, and is suitable for training multimodal LLMs and benchmarking model performance across languages.
提供机构:
VillanovaAI



