MusicPile

Name: MusicPile
Creator: maas
Published: 2025-12-05 16:14:41
License: 暂无描述

魔搭社区2025-12-05 更新2024-05-15 收录

下载链接：

https://modelscope.cn/datasets/m-a-p/MusicPile

下载链接

链接失效反馈

官方服务：

资源简介：

[**🌐 DemoPage**](https://ezmonyi.github.io/ChatMusician/) | [**🤗SFT Dataset**](https://huggingface.co/datasets/m-a-p/MusicPile-sft) | [**🤗 Benchmark**](https://huggingface.co/datasets/m-a-p/MusicTheoryBench) | [**📖 arXiv**](http://arxiv.org/abs/2402.16153) | [💻 **Code**](https://github.com/hf-lin/ChatMusician) | [**🤖 Chat Model**](https://huggingface.co/m-a-p/ChatMusician) | [**🤖 Base Model**](https://huggingface.co/m-a-p/ChatMusician-Base) # Dataset Card for MusicPile *MusicPile* is the first pretraining corpus for **developing musical abilities** in large language models. It has **5.17M** samples and approximately **4.16B** tokens, including web-crawled corpora, encyclopedias, music books, youtube music captions, musical pieces in abc notation, math content, and code. You can easily load it: ```python from datasets import load_dataset ds = load_dataset("m-a-p/MusicPile") ``` ## Dataset Details ### Dataset Description *MusicPile* was built on top of open-source datasets and high-quality data handcrafted by members of [MAP](https://m-a-p.ai/). Its sources are as follows: | Datasets | Sourced from | Tokens | # Samples | Category | Format | | --- | --- | --- | --- | --- | --- | | [pile](https://pile.eleuther.ai/) | public dataset | 0.83B | 18K | general | article | | [Falcon-RefinedWeb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb) | public dataset | 0.80B | 101K | general | article | | [Wikipedia](https://huggingface.co/datasets/tiiuae/falcon-refinedweb) | public dataset | 0.39B | 588K | general | article | | [OpenChat](https://huggingface.co/datasets/openchat/openchat_sharegpt4_dataset/tree/main) | public dataset | 62.44M | 43K | general | chat | | [LinkSoul](https://huggingface.co/datasets/LinkSoul/instruction_merge_set) | public dataset | 0.6B | 1.5M | general | chat | | [GPT4-Alpaca](https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM/blob/main/data/alpaca_gpt4_data.json) | public dataset | 9.77M | 49K | general | chat | | [Dolly](https://huggingface.co/datasets/databricks/databricks-dolly-15k) | public dataset | 3.12M | 14K | general | chat | | [IrishMAN](https://huggingface.co/datasets/sander-wood/irishman) | public dataset + Human-written Instructions | 0.23B | 868K | music score | chat | | [KernScores](http://kern.ccarh.org) | public dataset + Human-written Instructions | 2.76M | 10K | music score | chat | | [JSB Chorales](https://github.com/sander-wood/deepchoir) | public dataset + Human-written Instructions | 0.44M | 349 | music score | chat | | synthetic music chat* | public dataset + Human-written Instructions | 0.54B | 50K | music score | chat | | music knowledge** | Generated with GPT-4 | 0.22B | 255K | music verbal | chat | | music summary** | Generated with GPT-4 | 0.21B | 500K | music verbal | chat | | [GSM8k](https://huggingface.co/datasets/gsm8k) | public dataset | 1.68M | 7K | math | chat | | [math](https://huggingface.co/datasets/ArtifactAI/arxiv-math-instruct-50k) | public dataset | 7.03M | 37K | math | chat | | [MathInstruct](https://huggingface.co/datasets/TIGER-Lab/MathInstruct) | public dataset | 55.50M | 188K | math | chat | | [Camel-Math](https://huggingface.co/datasets/camel-ai/math) | public dataset | 27.76M | 50K | math | chat | | [arxiv-math-instruct-50k](https://huggingface.co/datasets/ArtifactAI/arxiv-math-instruct-50k) | public dataset | 9.06M | 50K | math | chat | | [Camel-Code](https://huggingface.co/datasets/camel-ai/code) | public dataset | 0.13B | 366K | code | chat | | [OpenCoder](https://huggingface.co/datasets/openchat/openchat_sharegpt4_dataset/tree/main) | public dataset | 36.99M | 28K | code | chat | | Total | - | 4.16B | 5.17M | - | - | ``` * means synthesis from music score data and general data. ** means with NEW rationales curated by us by prompting GPT-4. chat format refers to style as `Human: {...} </s> Assistant: {...} </s> ` ``` #### Language Corpora Curation **General corpora.** Representative public datasets, including [pile](https://pile.eleuther.ai/), [Falcon-RefinedWeb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb) and [Wikipedia](https://huggingface.co/datasets/tiiuae/falcon-refinedweb) are used. To curate a musically relevant corpus, we list a set of musicrelated words as a criterion to filter Pile, based on [music terminologies](https://en.m.wikipedia.org/wiki/Glossary_of_music_terminology). We only include music terminology words that appear more than 10 times and account for over 0.5% of domain agreement. **Instruction and chat data.** The instruction datasets [LinkSoul](https://huggingface.co/datasets/LinkSoul/instruction_merge_set), [GPT4-Alpaca](https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM/blob/main/data/alpaca_gpt4_data.json) and [Dolly](https://huggingface.co/datasets/databricks/databricks-dolly-15k) are diverse and representative enough to adapt the LLM to potential downstream usage. To enable multiple rounds of conversations, chat corpora [OpenChat](https://huggingface.co/datasets/openchat/openchat_sharegpt4_dataset/tree/main) are included. **Music knowledge and music summary.** We crawl the metadata corresponding to 2 million music tracks from YouTube, including metadata such as song title, description, album, artist, lyrics, playlist, etc. 500k of them are extracted. We generate summaries of these metadata using GPT-4. We generate music knowledge QA pairs following Self-instruct(https://arxiv.org/abs/2212.10560). According to our topic outline in [ChatMusician paper](http://arxiv.org/abs/2402.16153), 255k instructions are generated, with corresponding answers generated with GPT-4. **Math and code data.** The computational music community lacks symbolic music datasets,and we hypothesize that including math and code may enhance the reasoning power of symbolic music. [GSM8k](https://huggingface.co/datasets/gsm8k), [MathInstruct](https://huggingface.co/datasets/TIGER-Lab/MathInstruct), [Camel-Math](https://huggingface.co/datasets/camel-ai/math), [arxiv-math-instruct-50k](https://huggingface.co/datasets/ArtifactAI/arxiv-math-instruct-50k), [Camel-Code](https://huggingface.co/datasets/camel-ai/code) and [OpenCoder](https://huggingface.co/datasets/openchat/openchat_sharegpt4_dataset/tree/main) are included. Empirically, we find this helps to improve the performance of music LLMs. #### Music Score Corpora Curation Although symbolic music datasets are scarce in the computational music community, we have made an effort to include music from various regions of the world. Our music scores showcase significant regional diversity. We designed a total of eight representative musical tasks on the collected corpora,including six for generating music scores and two for music understanding. The generative tasks involve generating music scores conditioned on the chord, melody, motifs, musical form, and style. The understanding tasks involve extracting motifs and forms from the user input scores. The process of curating music instructions and algorithms is described in detail in [ChatMusician paper](http://arxiv.org/abs/2402.16153). Except for the general corpora,all the other datasets were constructed as conversation forms for one or more rounds.The percentage of musical verbal,code,music score,math,and general is 10.42%, 2.43%, 18.43%, 4.05%, and6 4.68%, respectively. The above table shows an overview of all data. ### Languages *MusicPile* primarily contains English. ## Dataset Structure *MusicPile* has 3 fields `id`, `text` and `src`. The amount of tokens of each text is no more than 2048(counted by LlamaTokenizer). ## Citation If you find our work helpful, feel free to give us a cite. ``` @misc{yuan2024chatmusician, title={ChatMusician: Understanding and Generating Music Intrinsically with LLM}, author={Ruibin Yuan and Hanfeng Lin and Yi Wang and Zeyue Tian and Shangda Wu and Tianhao Shen and Ge Zhang and Yuhang Wu and Cong Liu and Ziya Zhou and Ziyang Ma and Liumeng Xue and Ziyu Wang and Qin Liu and Tianyu Zheng and Yizhi Li and Yinghao Ma and Yiming Liang and Xiaowei Chi and Ruibo Liu and Zili Wang and Pengfei Li and Jingcheng Wu and Chenghua Lin and Qifeng Liu and Tao Jiang and Wenhao Huang and Wenhu Chen and Emmanouil Benetos and Jie Fu and Gus Xia and Roger Dannenberg and Wei Xue and Shiyin Kang and Yike Guo}, year={2024}, eprint={2402.16153}, archivePrefix={arXiv}, primaryClass={cs.SD} } ``` ## Dataset Card Contact Authors of ChatMusician.

[**🌐 演示页面**](https://ezmonyi.github.io/ChatMusician/) | [**🤗 监督微调（Supervised Fine-Tuning, SFT）数据集**](https://huggingface.co/datasets/m-a-p/MusicPile-sft) | [**🤗 基准测试集**](https://huggingface.co/datasets/m-a-p/MusicTheoryBench) | [**📖 arXiv论文**](http://arxiv.org/abs/2402.16153) | [💻 **代码仓库**](https://github.com/hf-lin/ChatMusician) | [**🤖 对话模型**](https://huggingface.co/m-a-p/ChatMusician) | [**🤖 基础模型**](https://huggingface.co/m-a-p/ChatMusician-Base) # MusicPile 数据集卡片 **MusicPile** 是首个用于在大语言模型（Large Language Model, LLM）中开发音乐能力的预训练语料库。该数据集包含**517万**个样本，总计约**41.6亿**个Token（tokens），涵盖网络爬取语料、百科全书内容、音乐书籍、YouTube音乐字幕、ABC记谱法的乐曲、数学内容以及代码。你可以通过以下代码轻松加载该数据集： python from datasets import load_dataset ds = load_dataset("m-a-p/MusicPile") ## 数据集详情 ### 数据集概述 **MusicPile** 基于开源数据集以及由[MAP](https://m-a-p.ai/)团队成员手工制作的高质量数据构建而成。其数据来源如下： | 数据集名称 | 数据来源 | Token数 | 样本数量 | 类别 | 格式 | | --- | --- | --- | --- | --- | --- | | [pile](https://pile.eleuther.ai/) | 公开数据集 | 8.3亿 | 1.8万 | 通用 | 文章 | | [Falcon-RefinedWeb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb) | 公开数据集 | 8.0亿 | 10.1万 | 通用 | 文章 | | [Wikipedia](https://huggingface.co/datasets/tiiuae/falcon-refinedweb) | 公开数据集 | 3.9亿 | 58.8万 | 通用 | 文章 | | [OpenChat](https://huggingface.co/datasets/openchat/openchat_sharegpt4_dataset/tree/main) | 公开数据集 | 6244万 | 4.3万 | 通用 | 对话 | | [LinkSoul](https://huggingface.co/datasets/LinkSoul/instruction_merge_set) | 公开数据集 | 6.0亿 | 150万 | 通用 | 对话 | | [GPT4-Alpaca](https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM/blob/main/data/alpaca_gpt4_data.json) | 公开数据集 | 977万 | 4.9万 | 通用 | 对话 | | [Dolly](https://huggingface.co/datasets/databricks/databricks-dolly-15k) | 公开数据集 | 312万 | 1.4万 | 通用 | 对话 | | [IrishMAN](https://huggingface.co/datasets/sander-wood/irishman) | 公开数据集 + 人工编写指令 | 2.3亿 | 86.8万 | 乐谱 | 对话 | | [KernScores](http://kern.ccarh.org) | 公开数据集 + 人工编写指令 | 276万 | 1万 | 乐谱 | 对话 | | [JSB Chorales](https://github.com/sander-wood/deepchoir) | 公开数据集 + 人工编写指令 | 44万 | 349 | 乐谱 | 对话 | | 合成音乐对话* | 公开数据集 + 人工编写指令 | 5.4亿 | 5万 | 乐谱 | 对话 | | 音乐知识** | GPT-4生成 | 2.2亿 | 25.5万 | 音乐文本 | 对话 | | 音乐摘要** | GPT-4生成 | 2.1亿 | 50万 | 音乐文本 | 对话 | | [GSM8k](https://huggingface.co/datasets/gsm8k) | 公开数据集 | 168万 | 0.7万 | 数学 | 对话 | | [math](https://huggingface.co/datasets/ArtifactAI/arxiv-math-instruct-50k) | 公开数据集 | 703万 | 3.7万 | 数学 | 对话 | | [MathInstruct](https://huggingface.co/datasets/TIGER-Lab/MathInstruct) | 公开数据集 | 5550万 | 18.8万 | 数学 | 对话 | | [Camel-Math](https://huggingface.co/datasets/camel-ai/math) | 公开数据集 | 2776万 | 5万 | 数学 | 对话 | | [arxiv-math-instruct-50k](https://huggingface.co/datasets/ArtifactAI/arxiv-math-instruct-50k) | 公开数据集 | 906万 | 5万 | 数学 | 对话 | | [Camel-Code](https://huggingface.co/datasets/camel-ai/code) | 公开数据集 | 1.3亿 | 36.6万 | 代码 | 对话 | | [OpenCoder](https://huggingface.co/datasets/openchat/openchat_sharegpt4_dataset/tree/main) | 公开数据集 | 3699万 | 2.8万 | 代码 | 对话 | | 总计 | - | 41.6亿 | 517万 | - | - | * 表示由乐谱数据与通用数据合成生成。 ** 表示通过提示GPT-4生成并由我们整理的全新原理阐释数据。对话格式指采用 `Human: {...} </s> Assistant: {...} </s> ` 的样式。 #### 通用语料库整理 **通用语料**：我们采用了具有代表性的公开数据集，包括[pile](https://pile.eleuther.ai/)、[Falcon-RefinedWeb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb)以及[Wikipedia](https://huggingface.co/datasets/tiiuae/falcon-refinedweb)。为了整理与音乐相关的语料库，我们基于[音乐术语表](https://en.m.wikipedia.org/wiki/Glossary_of_music_terminology)，列出一系列音乐相关词汇作为筛选Pile数据集的标准。我们仅保留出现次数超过10次且占领域共识比例超过0.5%的音乐术语词汇。 **指令与对话数据**：[LinkSoul](https://huggingface.co/datasets/LinkSoul/instruction_merge_set)、[GPT4-Alpaca](https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM/blob/main/data/alpaca_gpt4_data.json)以及[Dolly](https://huggingface.co/datasets/databricks/databricks-dolly-15k)等指令数据集具有足够的多样性与代表性，可帮助大语言模型适配潜在的下游应用场景。为支持多轮对话，我们纳入了[OpenChat](https://huggingface.co/datasets/openchat/openchat_sharegpt4_dataset/tree/main)对话语料库。 **音乐知识与音乐摘要**：我们从YouTube爬取了200万首音乐曲目对应的元数据，包括歌曲标题、描述、专辑、艺人、歌词、播放列表等信息，并从中提取了50万条数据。我们使用GPT-4为这些元数据生成摘要。我们遵循Self-instruct（https://arxiv.org/abs/2212.10560）的方法生成音乐知识问答对。基于[ChatMusician论文](http://arxiv.org/abs/2402.16153)中的主题大纲，我们生成了25.5万条指令，并使用GPT-4生成对应的回答。 **数学与代码数据**：计算音乐领域缺乏符号音乐数据集，我们推测加入数学与代码数据可提升符号音乐相关任务的推理能力。我们纳入了[GSM8k](https://huggingface.co/datasets/gsm8k)、[MathInstruct](https://huggingface.co/datasets/TIGER-Lab/MathInstruct)、[Camel-Math](https://huggingface.co/datasets/camel-ai/math)、[arxiv-math-instruct-50k](https://huggingface.co/datasets/ArtifactAI/arxiv-math-instruct-50k)、[Camel-Code](https://huggingface.co/datasets/camel-ai/code)以及[OpenCoder](https://huggingface.co/datasets/openchat/openchat_sharegpt4_dataset/tree/main)等数据集。经实验验证，该操作可有效提升音乐大语言模型的性能。 #### 乐谱语料库整理尽管计算音乐领域的符号音乐数据集较为稀缺，我们仍尽力收录了来自全球不同地区的音乐内容，所收录的乐谱展现出显著的地域多样性。我们基于收集的语料库设计了共计8项典型音乐任务，其中6项为乐谱生成任务，2项为音乐理解任务。生成任务涵盖基于和弦、旋律、动机、曲式与风格生成乐谱。理解任务则包括从用户输入的乐谱中提取动机与曲式。乐谱指令与算法的整理流程已在[ChatMusician论文](http://arxiv.org/abs/2402.16153)中详细说明。除通用语料库外，其余所有数据集均被构建为单轮或多轮对话形式。音乐文本、代码、乐谱、数学与通用数据的占比分别为10.42%、2.43%、18.43%、4.05%与64.68%。上述表格展示了所有数据的整体概况。 ### 语言类型 **MusicPile** 主要包含英文内容。 ## 数据集结构 **MusicPile** 包含`id`、`text`与`src`三个字段，每条文本的Token数不超过2048（以LlamaTokenizer计数）。 ## 引用信息若您认为本工作对您有所帮助，请引用我们的论文。 @misc{yuan2024chatmusician, title={ChatMusician: Understanding and Generating Music Intrinsically with LLM}, author={Ruibin Yuan and Hanfeng Lin and Yi Wang and Zeyue Tian and Shangda Wu and Tianhao Shen and Ge Zhang and Yuhang Wu and Cong Liu and Ziya Zhou and Ziyang Ma and Liumeng Xue and Ziyu Wang and Qin Liu and Tianyu Zheng and Yizhi Li and Yinghao Ma and Yiming Liang and Xiaowei Chi and Ruibo Liu and Zili Wang and Pengfei Li and Jingcheng Wu and Chenghua Lin and Qifeng Liu and Tao Jiang and Wenhao Huang and Wenhu Chen and Emmanouil Benetos and Jie Fu and Gus Xia and Roger Dannenberg and Wei Xue and Shiyin Kang and Yike Guo}, year={2024}, eprint={2402.16153}, archivePrefix={arXiv}, primaryClass={cs.SD} } ## 数据集卡片联系人 ChatMusician项目作者。

提供机构：

maas

创建时间：

2024-04-14

5,000+

优质数据集

54 个

任务类型

进入经典数据集