five

CoSyn-400K

收藏
魔搭社区2025-11-27 更新2025-05-31 收录
下载链接:
https://modelscope.cn/datasets/allenai/CoSyn-400K
下载链接
链接失效反馈
官方服务:
资源简介:
# CoSyn-400k CoSyn-400k is a collection of synthetic question-answer pairs about very diverse range of computer-generated images. The data was created by using the [Claude large language model](https://claude.ai/) to generate code that can be executed to render an image, and using [GPT-4o mini](https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/) to generate Q/A pairs based on the code (without using the rendered image). The code used to generate this data is [open source](https://github.com/allenai/pixmo-docs). Synthetic pointing data is available in a [seperate repo](https://huggingface.co/datasets/allenai/CoSyn-point). Quick links: - 📃 [CoSyn Paper](https://arxiv.org/pdf/2502.14846) - 📃 [Molmo Paper](https://molmo.allenai.org/paper.pdf) ## Loading The dataset has several subsets: - chart - chemical - circuit - diagram - document - graphic - math - music - nutrition - tableuments Use `config_name` to specify which one to load, by default `chart` will be loaded. For example: ```python table_dataset = datasets.load_dataset("allenai/CoSyn-400K", "table", split="train") ``` ## Data Format The rendered image is included in the dataset directly: ```python print(table_dataset[0]["image"]) # >>> PIL.PngImagePlugin.PngImageFile image mode=RGB size=2400x1200 at 0x7F362070CEB0> ``` Each image is matched with multiple question-answer pairs: ```python for q, a in zip(table_dataset[0]["questions"]["question"], table_dataset[0]["questions"]["answer"]): print(q, a) # >>> # What is the waist circumference range for adult females? 64-88 cm # What is the weight range for children aged 2-12 years? 10-45 kg # Is the BMI range for infants provided in the table? No # Which age group has the highest resting heart rate range? Infants (0-1 year) # What is the difference in lung capacity range between adolescents and elderly? Maximum difference: 0.5 L, Minimum difference: 1.5 L # Do adult males have a higher blood pressure range than adolescents? Yes # What is the average height of elderly females compared to male adolescents? Male adolescents are taller by 10 cm # Does the table provide a consistent BMI range across all groups for females? Yes # Which gender has a lower average hip circumference range among the elderly? Females have a lower average hip circumference ``` ## Splits The data is divided into validation and train splits. These splits are "unofficial" because we do not generally use this data for evaluation anyway. However, they reflect what we used when training. ## License This dataset is licensed by ODC-BY-1.0. It is intended for research and educational use in accordance with Ai2's [Responsible Use Guidelines](https://allenai.org/responsible-use). This dataset includes output images derived from code generated from Claude that are subject to Anthropic [terms of service](https://www.anthropic.com/legal/commercial-terms) and [usage policy](https://www.anthropic.com/legal/aup). The questions were generated from GPT-4o Mini and are subject to [separate terms](https://openai.com/policies/row-terms-of-use) governing their use. ## Citation Please cite the following papers if you use this dataset in your work. ```bibtex @article{yang2025scaling, title={Scaling Text-Rich Image Understanding via Code-Guided Synthetic Multimodal Data Generation}, author={Yang, Yue and Patel, Ajay and Deitke, Matt and Gupta, Tanmay and Weihs, Luca and Head, Andrew and Yatskar, Mark and Callison-Burch, Chris and Krishna, Ranjay and Kembhavi, Aniruddha and others}, journal={arXiv preprint arXiv:2502.14846}, year={2025} } ``` ```bibtex @article{deitke2024molmo, title={Molmo and pixmo: Open weights and open data for state-of-the-art multimodal models}, author={Deitke, Matt and Clark, Christopher and Lee, Sangho and Tripathi, Rohun and Yang, Yue and Park, Jae Sung and Salehi, Mohammadreza and Muennighoff, Niklas and Lo, Kyle and Soldaini, Luca and others}, journal={arXiv preprint arXiv:2409.17146}, year={2024} } ```

# CoSyn-400k CoSyn-400k 是一个涵盖多样化计算机生成图像的合成问答对数据集。本数据集通过使用Claude大语言模型(Large Language Model)生成可执行代码以渲染图像,并借助GPT-4o mini基于该代码生成问答对(无需使用渲染后的图像)创建而成。用于生成本数据集的代码已开源,开源地址为 https://github.com/allenai/pixmo-docs。此外,合成指向数据可在独立仓库 https://huggingface.co/datasets/allenai/CoSyn-point 中获取。 快速链接: - 📃 [CoSyn论文](https://arxiv.org/pdf/2502.14846) - 📃 [Molmo论文](https://molmo.allenai.org/paper.pdf) ## 数据集加载 该数据集包含多个子集:chart、chemical、circuit、diagram、document、graphic、math、music、nutrition、tableuments。可通过`config_name`参数指定待加载的子集,默认加载`chart`子集。示例如下: python table_dataset = datasets.load_dataset("allenai/CoSyn-400K", "table", split="train") ## 数据格式 数据集直接包含渲染后的图像: python print(table_dataset[0]["image"]) # >>> PIL.PngImagePlugin.PngImageFile image mode=RGB size=2400x1200 at 0x7F362070CEB0> 每张图像对应多组问答对: python for q, a in zip(table_dataset[0]["questions"]["question"], table_dataset[0]["questions"]["answer"]): print(q, a) # >>> # 成年女性的腰围范围是多少? 64-88 cm # 2-12岁儿童的体重范围是多少? 10-45 kg # 该表格是否提供了婴儿的BMI范围? 否 # 哪个年龄段的静息心率范围最高? 婴儿(0-1岁) # 青少年与老年人的肺活量范围差值分别为多少? 最大差值:0.5 L,最小差值:1.5 L # 成年男性的血压范围是否高于青少年? 是 # 老年女性与男性青少年的平均身高对比如何? 男性青少年比老年女性高10 cm # 该表格是否为所有女性群体提供了统一的BMI范围? 是 # 老年人中平均臀围范围更低的是哪个性别? 女性的平均臀围范围更低 ## 数据集划分 本数据集分为验证集与训练集两类划分,此类划分并非官方标准划分,因为本数据集通常不用于模型评估流程,但该划分与我们训练时所使用的划分保持一致。 ## 授权协议 本数据集采用ODC-BY-1.0协议开源,旨在遵循艾伦人工智能研究所(Allen AI)的[负责任使用指南](https://allenai.org/responsible-use),供研究与教育场景使用。本数据集包含由Claude生成的代码所渲染的图像,此类图像受Anthropic的[服务条款](https://www.anthropic.com/legal/commercial-terms)与[使用政策](https://www.anthropic.com/legal/aup)约束。本数据集的问答对由GPT-4o mini生成,其使用需遵循[单独的使用条款](https://openai.com/policies/row-terms-of-use)。 ## 引用方式 若您在研究工作中使用本数据集,请引用以下论文: bibtex @article{yang2025scaling, title={Scaling Text-Rich Image Understanding via Code-Guided Synthetic Multimodal Data Generation}, author={Yang, Yue and Patel, Ajay and Deitke, Matt and Gupta, Tanmay and Weihs, Luca and Head, Andrew and Yatskar, Mark and Callison-Burch, Chris and Krishna, Ranjay and Kembhavi, Aniruddha and others}, journal={arXiv preprint arXiv:2502.14846}, year={2025} } bibtex @article{deitke2024molmo, title={Molmo and pixmo: Open weights and open data for state-of-the-art multimodal models}, author={Deitke, Matt and Clark, Christopher and Lee, Sangho and Tripathi, Rohun and Yang, Yue and Park, Jae Sung and Salehi, Mohammadreza and Muennighoff, Niklas and Lo, Kyle and Soldaini, Luca and others}, journal={arXiv preprint arXiv:2409.17146}, year={2024} }
提供机构:
maas
创建时间:
2025-05-27
搜集汇总
数据集介绍
main_image_url
背景与挑战
背景概述
CoSyn-400K是一个由AllenAI发布的合成数据集,包含约40万个基于计算机生成图像的问答对。这些数据通过Claude生成代码渲染图像,并使用GPT-4o mini基于代码生成问答,覆盖多个子集如chart和chemical,适用于多模态研究。数据集以ODC-BY-1.0许可证发布,专为研究和教育目的设计。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作