five

VillanovaAI/Multi-CoSyn-400K

收藏
Hugging Face2026-01-23 更新2026-02-07 收录
下载链接:
https://hf-mirror.com/datasets/VillanovaAI/Multi-CoSyn-400K
下载链接
链接失效反馈
官方服务:
资源简介:
Multi-CoSyn-400K是原始CoSyn-400K数据集的多语言扩展版本,包含基于文本丰富图像的问题-答案对及推理。数据集通过代码渲染工具和文本LLM生成图像和问答对,旨在为视觉LLM提供新的训练材料,以改善对文本丰富图像的推理能力。数据集包含10个子集,涵盖图表、化学、电路、图表、文档、图形、数学、音乐、营养和表格等类型,总计365,615个示例。问答对和推理使用Qwen3-VL-235B-A22B-Instruct模型重新生成,确保所有注释均可自由重用。数据集支持英语、意大利语、法语、西班牙语和德语五种语言,适用于多模态LLM的训练和多语言模型性能的基准测试。

Multi-CoSyn-400K is a multilingual extension of the original CoSyn-400K dataset, consisting of question-answer pairs with reasoning on text-rich images. The dataset was generated using code-based rendering tools and text-only LLMs to create new training material for vision LLMs, improving reasoning on text-rich images. It includes ten subsets covering chart, chemical, circuit, diagram, document, graphic, math, music, nutrition, and tables, totaling 365,615 examples. QA pairs and reasoning were re-generated using the Qwen3-VL-235B-A22B-Instruct model, ensuring all annotations are fully open for reuse. The dataset supports five languages: English, Italian, French, Spanish, and German, and is suitable for training multimodal LLMs and benchmarking model performance across languages.
提供机构:
VillanovaAI
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作