资源简介:
---
dataset_info:
features:
- name: instruction
dtype: string
- name: input
dtype: string
- name: output
dtype: string
- name: src
dtype: string
- name: id
dtype: int64
splits:
- name: train
num_bytes: 1926209131
num_examples: 1139473
download_size: 985473122
dataset_size: 1926209131
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
---
[**🌐 DemoPage**](https://ezmonyi.github.io/ChatMusician/) | [**🤗 Pretrain Dataset**](https://huggingface.co/datasets/m-a-p/MusicPile) | [**🤗 Benchmark**](https://huggingface.co/datasets/m-a-p/MusicTheoryBench) | [**📖 arXiv**](http://arxiv.org/abs/2402.16153) | [💻 **Code**](https://github.com/hf-lin/ChatMusician) | [**🤖 Chat Model**](https://huggingface.co/m-a-p/ChatMusician) | [**🤖 Base Model**](https://huggingface.co/m-a-p/ChatMusician-Base)
# Dataset Card for MusicPile-sft
*MusicPile-sft* is a subset of [MusicPile](https://huggingface.co/datasets/m-a-p/MusicPile).
It contains **1.14M** samples with a ratio of music verbal to music score(abc notation) of 2:1.
Here is the overview:
| Datasets | Sourced from | # Samples | Category | Format |
| --- | --- | --- | --- | --- |
| [IrishMAN](https://huggingface.co/datasets/sander-wood/irishman) | public dataset + Human-written Instructions | 340K | music score | chat |
| [KernScores](http://kern.ccarh.org) | public dataset + Human-written Instructions | 10K | music score | chat |
| [JSB Chorales](https://github.com/sander-wood/deepchoir) | public dataset + Human-written Instructions | 33.5k | music score | chat |
| music knowledge** | Generated with GPT-4 | 255K | music verbal | chat |
| music summary** | Generated with GPT-4 | 500K | music verbal | chat |
Note: The data of JSB Chorales is repeated 100 times.(Because there is so little data on compositions in the Bach style.)
You can easily load it:
```python
from datasets import load_dataset
ds = load_dataset("m-a-p/MusicPile-sft")
```
## Languages
*MusicPile-sft* primarily contains English.
## Dataset Structure
*MusicPile-sft* has 5 fields `id`,`src`, `input`, `instruction` and `output`.
## Citation
If you find our work helpful, feel free to give us a cite.
```
@misc{yuan2024chatmusician,
title={ChatMusician: Understanding and Generating Music Intrinsically with LLM},
author={Ruibin Yuan and Hanfeng Lin and Yi Wang and Zeyue Tian and Shangda Wu and Tianhao Shen and Ge Zhang and Yuhang Wu and Cong Liu and Ziya Zhou and Ziyang Ma and Liumeng Xue and Ziyu Wang and Qin Liu and Tianyu Zheng and Yizhi Li and Yinghao Ma and Yiming Liang and Xiaowei Chi and Ruibo Liu and Zili Wang and Pengfei Li and Jingcheng Wu and Chenghua Lin and Qifeng Liu and Tao Jiang and Wenhao Huang and Wenhu Chen and Emmanouil Benetos and Jie Fu and Gus Xia and Roger Dannenberg and Wei Xue and Shiyin Kang and Yike Guo},
year={2024},
eprint={2402.16153},
archivePrefix={arXiv},
primaryClass={cs.SD}
}
```
## Dataset Card Contact
Authors of ChatMusician.
---
数据集信息:
特征:
- 名称:instruction(指令),数据类型:字符串
- 名称:input(输入),数据类型:字符串
- 名称:output(输出),数据类型:字符串
- 名称:src(数据来源),数据类型:字符串
- 名称:id(样本编号),数据类型:64位整数
划分:
- 名称:train(训练集),字节数:1926209131,样本数:1139473
下载大小:985473122
数据集总大小:1926209131
配置:
- 配置名称:default(默认配置)
数据文件:
- 划分:train(训练集)
路径:data/train-*
---
[**🌐 演示页面**](https://ezmonyi.github.io/ChatMusician/) | [**🤗 预训练数据集**](https://huggingface.co/datasets/m-a-p/MusicPile) | [**🤗 基准测试集**](https://huggingface.co/datasets/m-a-p/MusicTheoryBench) | [**📖 arXiv论文**](http://arxiv.org/abs/2402.16153) | [**💻 代码仓库**](https://github.com/hf-lin/ChatMusician) | [**🤖 对话模型**](https://huggingface.co/m-a-p/ChatMusician) | [**🤖 基础模型**](https://huggingface.co/m-a-p/ChatMusician-Base)
# MusicPile-sft 数据集卡片
*MusicPile-sft*是[MusicPile](https://huggingface.co/datasets/m-a-p/MusicPile)的一个子集。该数据集包含**114万**个样本,音乐文本与音乐乐谱(ABC记谱法 (abc notation))的比例为2:1。以下是数据集概览:
| 数据集名称 | 数据来源 | 样本数量 | 类别 | 格式 |
| --- | --- | --- | --- | --- |
| [IrishMAN](https://huggingface.co/datasets/sander-wood/irishman) | 公开数据集 + 人工编写指令 | 34万 | 音乐乐谱 | 对话格式 |
| [KernScores](http://kern.ccarh.org) | 公开数据集 + 人工编写指令 | 1万 | 音乐乐谱 | 对话格式 |
| [JSB Chorales](https://github.com/sander-wood/deepchoir) | 公开数据集 + 人工编写指令 | 3.35万 | 音乐乐谱 | 对话格式 |
| 音乐知识 | 由GPT-4生成 | 25.5万 | 音乐文本 | 对话格式 |
| 音乐摘要 | 由GPT-4生成 | 50万 | 音乐文本 | 对话格式 |
注:JSB Chorales的数据被重复了100次(因巴赫风格作品的公开数据集样本量极少)。
你可以通过以下代码轻松加载该数据集:
python
from datasets import load_dataset
ds = load_dataset("m-a-p/MusicPile-sft")
## 语言说明
*MusicPile-sft*主要包含英文文本。
## 数据集结构
*MusicPile-sft*包含5个字段:`id`(样本编号)、`src`(数据来源)、`input`(输入)、`instruction`(指令)以及`output`(输出)。
## 引用
如果您的工作用到了该数据集,请引用我们的论文:
@misc{yuan2024chatmusician,
title={ChatMusician: Understanding and Generating Music Intrinsically with 大语言模型(LLM)},
author={Ruibin Yuan and Hanfeng Lin and Yi Wang and Zeyue Tian and Shangda Wu and Tianhao Shen and Ge Zhang and Yuhang Wu and Cong Liu and Ziya Zhou and Ziyang Ma and Liumeng Xue and Ziyu Wang and Qin Liu and Tianyu Zheng and Yizhi Li and Yinghao Ma and Yiming Liang and Xiaowei Chi and Ruibo Liu and Zili Wang and Pengfei Li and Jingcheng Wu and Chenghua Lin and Qifeng Liu and Tao Jiang and Wenhao Huang and Wenhu Chen and Emmanouil Benetos and Jie Fu and Gus Xia and Roger Dannenberg and Wei Xue and Shiyin Kang and Yike Guo},
year={2024},
eprint={2402.16153},
archivePrefix={arXiv},
primaryClass={cs.SD}
}
## 数据集卡片联系方式
ChatMusician项目作者。