SUNO-1M
收藏魔搭社区2025-12-05 更新2025-07-12 收录
下载链接:
https://modelscope.cn/datasets/sleeping-ai/SUNO-1M
下载链接
链接失效反馈官方服务:
资源简介:
<h1 align='center'> SUNO 1 Million</h1>
<div align="center">
<img src="SUNO.gif" alt="loading", width="100">
</div>
**Update 04/07/2025**
Excited to share, SUNO Dataset, the largest known collection of artificially generated songs from the most popular AI music generator SUNO. SUNO dataset is an on-going effort of Sleeping AI to provide groundbreaking and impactful datasets to create and research for state-of-art Generative Music modelling tasks. It is a collection of datasets that includes Sleeping-DISCO, SUNO and many more.
### Why SUNO?
SUNO is one of the key three players in the market, backed by a wonderful research team and community that loves to make songs. Artificially generated songs are still heavily debated and majority of share in the research is held by Chinese universities and labs. While their quality is debatable and I myself never believed this can be a business so fast and soon.
> Hey! We are here now and I am sharing this dataset with everyone.
### How the dataset was sourced and collected?
I won't disclose the actual mechanism behind its collection but we did not scrape the website because cloudflare and SUNO's paid Clerk tokens are notoriously obnoxious to get past.
To quote:
> It was sourced responsibly and maybe if SUNO team did a better job in their engineering, I won't be able to download it.
### Data fields
`uuid`: a unique identification for each audio
`user_id`: to trace each user and where the data coming-from
`audio_url`: LINK to audio
`video_url`: LINK to video
`model_name`: Model used for creating the song
`major_model_version`: Tracking model version across songs
### How about a DMCA takedown?
Yes, it is possible but there are many other versions of SUNO's database on Huggingface and Kaggle that are-up. I am obviously not the first. Besides, I allow the option if you think there was a very very original song that's present in this dataset, I will take it down. Just point me to its `uuid`
Btw, I followed all the standard data collection rules that's mentioned in EU, I only scraped the data that was publicly available and I did it purely for scientific research and purpose. Besides, SUNO probably trained its models on someone else's data as well.
+I will highlight that there's a paper called
1. [SUNOCAPs](https://www.sciencedirect.com/science/article/pii/S2352340924007078)
This papers have used SUNO songs before. I don't believe my dataset did anything different than what many others have did before.
### Fine, Why its better?
Stark difference between Me and my competition will be, I have deduplicated and validated each data-point I have uploaded. Previously, [nyuuzyou](https://huggingface.co/nyuuzyou) did a nice contribution of 655K songs but it was riddled with jsonl garbage that exposed personal information and unnecessary context. And it did not provide estimate of entire SUNO database and failed provide a valid user list. Mine does following:
1. We are the first high-quality SUNO dataset to reach 1M (million) songs.
2. We offer both audio and video in nicely done format.
3. We have plans to provide additional data by burning GPU hours for the sake of research and a nice arXiv paper.
4. We estimate SUNO has 1.5-2M songs. and we had a **33% error rate** considering 1.5M being database size.
### Ethics statement
We have compiled the data in good faith and made sure that we respected individual privacy and we have downloaded the data under our local and European law.
### LICENCE
We offer this dataset under CC-by-nc-nd 4.0, which prohibits following acts:
1. Data can't be shared and individuals aside from the original author is not allowed to create derivatives.
2. You have to seek permission of Sleeping AI, before modifying and sharing online.
Feel free to reach out on X (formerly twitter): https://x.com/sleeping4cat
<h1 align='center'> SUNO百万数据集</h1>
<div align="center">
<img src="SUNO.gif" alt="加载中", width="100">
</div>
**更新于2025年4月7日**
很高兴向大家分享SUNO数据集——目前已知规模最大的来自顶级AI音乐生成工具SUNO的人工创作歌曲合集。本数据集是Sleeping AI团队的持续推进项目,旨在为前沿生成式音乐建模任务的创作与研究提供具有突破性与影响力的数据支撑。该合集包含Sleeping-DISCO、SUNO等多个子数据集。
### 为何选择SUNO数据集?
SUNO是当前AI音乐生成市场的三大核心玩家之一,拥有一支优秀的研发团队与热爱音乐创作的社区群体。目前,人工生成歌曲仍存在广泛争议,且该领域的研究话语权多由中国高校与实验室掌握。尽管此类生成作品的质量尚存争议,我本人也曾未料到AI音乐产业能在如此短的时间内实现商业化落地。
> 嘿!我们现已正式推出该数据集,并将其免费分享给所有用户。
### 数据集的获取与采集方式
我不会披露本次采集的具体机制,但我们并未通过爬取官网的方式获取数据——由于Cloudflare防护与SUNO的付费Clerk令牌均极难绕过,此类手段并不现实。
> 本次采集严格遵循合规原则。倘若SUNO团队在工程防护上做得更完善,我们或许根本无法获取这批数据。
### 数据字段说明
`uuid`:每条音频的唯一标识符
`user_id`:用于追踪用户身份及数据来源
`audio_url`:音频文件链接
`video_url`:视频文件链接
`model_name`:用于生成该歌曲的AI模型名称
`major_model_version`:用于追踪不同歌曲所使用的模型版本
### 关于DMCA(数字千年版权法案)下架请求
确实存在下架的可能,但目前Huggingface与Kaggle平台上已存在多个SUNO数据集的复刻版本,我并非首个发布此类数据集的开发者。若您认为本数据集中存在您原创的歌曲,可通过提供对应`uuid`的方式告知我们,我们将立即下架该条目。
此外,本次数据采集严格遵循欧盟相关标准数据采集规范,仅获取公开可访问的数据,且所有操作均仅用于科学研究与学术探索。况且,SUNO模型的训练本身也可能使用了第三方公开数据。
此外需说明,已有一篇名为[SUNOCAPs](https://www.sciencedirect.com/science/article/pii/S2352340924007078)的论文曾使用过SUNO生成的歌曲。我认为本数据集的发布并未超出此前同类工作的范畴。
### 本数据集的优势何在?
与其他同类数据集相比,本数据集的核心优势在于:我们对已上传的每一条数据都进行了去重与有效性验证。此前,用户[nyuuzyou](https://huggingface.co/nyuuzyou)曾发布过包含65.5万首歌曲的数据集,但该数据集存在大量冗余的JSONL格式垃圾数据,且泄露了用户个人信息与无关上下文;同时,其未对SUNO数据库的整体规模进行估算,也未能提供有效的用户列表。本数据集则具备以下特点:
1. 本数据集是首个规模突破100万首歌曲的高质量SUNO数据集。
2. 同时提供格式规范的音频与视频文件。
3. 我们计划投入GPU算力以扩充数据集,并将发布相关学术论文至arXiv平台。
4. 我们估算SUNO官方数据库的总规模约为150万至200万首歌曲,以150万首为基准计算,本数据集的采集误差率为**33%**。
### 伦理声明
我们本着善意采集本数据集,严格尊重用户个人隐私,且所有数据获取行为均符合当地及欧盟相关法律法规。
### 授权协议
本数据集采用CC BY-NC-ND 4.0(知识共享署名-非商业性使用-禁止演绎4.0国际许可协议),以下行为均被禁止:
1. 未经授权不得共享数据集,且除原作者外,任何主体不得基于本数据集创作衍生作品。
2. 如需对本数据集进行修改或在线分享,需事先获得Sleeping AI团队的许可。
欢迎通过X(原Twitter)平台联系我们:https://x.com/sleeping4cat
提供机构:
maas
创建时间:
2025-07-07



