AnyInstruct
收藏魔搭社区2025-12-05 更新2025-11-15 收录
下载链接:
https://modelscope.cn/datasets/openmoss/AnyInstruct
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset details
## Dataset type
AnyInstruct is a dataset comprised of 108k multimodal instruction-following data integrates multiple modalities—text, speech, images, and music—in an interleaved manner.
## Data Construction
We first synthesize textual multimodal dialogues using GPT-4, and then generate images, music, and voices using DALL-E 3, MusicGen, and the Azure Text-to-Speech API, respectively. The voice component comprises 39 different timbres, with the speech rate randomly sampled within a certain range.
## File organization
The data in **part1 and part2** contain all modalities, totaling 108k high-quality multimodal dialogues, featuring a variety of multimodal combinations. This dataset includes around 205k images, 503k voice recordings, and 113k music tracks.
And you can view the intermediate content in the data construction process, such as topics, scenarios, captions for images and voices, etc., in the **data_construction** folder.
The **speech_conv** directory consists of vocal dialogues. We cleaned suitable dialogues for vocalization from existing textual instruction datasets and synthesized the voices, totaling 108k entries.
The images generated by DALL-E 3 have a resolution of 1024×1024. To reduce storage requirements, this repository uses images with a resolution of 224×224. If high-resolution images are needed, please download them from https://huggingface.co/datasets/fnlp/AnyInstruct-resolution-1024.
**Paper or resources for more information:** https://junzhan2000.github.io/AnyGPT.github.io/
**Where to send questions or comments about the model:** https://github.com/OpenMOSS/AnyGPT/issues
## Citation
If you find AnyInstruct useful in your research or applications, please kindly cite:
```
@article{zhan2024anygpt,
title={AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling},
author={Zhan, Jun and Dai, Junqi and Ye, Jiasheng and Zhou, Yunhua and Zhang, Dong and Liu, Zhigeng and Zhang, Xin and Yuan, Ruibin and Zhang, Ge and Li, Linyang and others},
journal={arXiv preprint arXiv:2402.12226},
year={2024}
}
```
# 数据集详情
## 数据集类型
AnyInstruct是一个包含10.8万条多模态指令跟随数据(multimodal instruction-following data)的数据集,该数据集以交错整合的方式融合了文本、语音、图像与音乐多种模态。
## 数据构建流程
我们首先通过GPT-4合成文本多模态对话,随后分别借助DALL-E 3、MusicGen以及Azure文本转语音API(Azure Text-to-Speech API)生成图像、音乐与语音素材。其中语音素材涵盖39种不同音色,语速在指定范围内随机采样。
## 文件组织结构
**part1与part2**中的数据包含全部模态,总计10.8万条高质量多模态对话,涵盖多种多模态组合形式。该数据集共包含约20.5万张图像、50.3万条语音录音以及11.3万首音乐曲目。
用户可在**data_construction**文件夹中查看数据构建流程中的中间产物,例如主题、场景、图像与语音的标注文本等。
**speech_conv**目录收录语音对话内容:我们从现有文本指令数据集中筛选出适合语音化的对话,并进行语音合成,最终生成10.8万条数据条目。
通过DALL-E 3生成的原始图像分辨率为1024×1024。为降低存储开销,本仓库默认采用分辨率为224×224的图像。若需获取高分辨率图像,请从以下链接下载:https://huggingface.co/datasets/fnlp/AnyInstruct-resolution-1024。
**更多信息可查阅论文或相关资源:** https://junzhan2000.github.io/AnyGPT.github.io/
**关于该数据集的问题或意见,请提交至:** https://github.com/OpenMOSS/AnyGPT/issues
## 引用声明
若您的研究或应用中使用了AnyInstruct数据集,请引用如下文献:
@article{zhan2024anygpt,
title={AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling},
author={Zhan, Jun and Dai, Junqi and Ye, Jiasheng and Zhou, Yunhua and Zhang, Dong and Liu, Zhigeng and Zhang, Xin and Yuan, Ruibin and Zhang, Ge and Li, Linyang and others},
journal={arXiv preprint arXiv:2402.12226},
year={2024}
}
提供机构:
maas
创建时间:
2025-10-23



