adricl/midi_godzilla_piano_webdataset_1024
收藏Hugging Face2026-03-09 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/adricl/midi_godzilla_piano_webdataset_1024
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
size_categories:
- 1M<n<10M
- Godzilla
- MIDI
- MIDI dataset
- MIDI music
- giant
- raw
- searchable
- comprehensive
- music
- music ai
- MIR
- webdataset
---
# MIDI Godzilla Piano Webdataset split into 1024 chunks
Webdataset Midi of the [Godzilla MIDI Dataset](https://huggingface.co/datasets/projectlosangeles/Godzilla-MIDI-Dataset) from Project Los Angeles
This dataset card aims to be a base template for new datasets. It has been generated using [this raw template](https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/templates/datasetcard_template.md?plain=1).
## Dataset Details
### Dataset Description
This dataset has been created by splitting the midi files into 1024 tokens.
We then split the traning set into 70% traning 15% validation and 15% test.
We augment the midi as per miditok.
````
augment_dataset(
subset_chunks_dir,
pitch_offsets=[-12, 12],
velocity_offsets=[-4, 4],
duration_offsets=[-1, 1]
)
````
Then output the files and save into Webdataset
````
sample = {
"__key__": str(midi_path.relative_to(root_dir_path)),
"id": i,
"midi_file": midi_data,
}
````
The keyname is "mid.midi_file"
### Dataset Sources [optional]
- **Repository:** [Godzilla MIDI Dataset](https://huggingface.co/datasets/projectlosangeles/Godzilla-MIDI-Dataset) from Project Los Angeles
- **Demo [optional]:** [Midi Jam Session](https://huggingface.co/spaces/adricl/MidiJamSession)
## Uses
This is used for traning midi based transformers or anything that requires 1024 chunks of midi.
## Dataset Structure
[More Information Needed]
## Dataset Creation
### Curation Rationale
<!-- Motivation for the creation of this dataset. -->
[More Information Needed]
### Source Data
[Godzilla MIDI Dataset](https://huggingface.co/datasets/projectlosangeles/Godzilla-MIDI-Dataset)
许可证:Apache 2.0
数据规模分类:
- 100万条 < 样本数 < 1000万条
- 哥斯拉(Godzilla)
- MIDI(Musical Instrument Digital Interface)
- MIDI数据集
- MIDI音乐
- 巨型
- 原始
- 可检索
- 全面
- 音乐
- 音乐人工智能
- 音乐信息检索(MIR,Music Information Retrieval)
- WebDataset
# 拆分为1024个分块的MIDI哥斯拉钢琴Webdataset数据集
本数据集为洛杉矶项目(Project Los Angeles)发布的[哥斯拉MIDI数据集(Godzilla MIDI Dataset)](https://huggingface.co/datasets/projectlosangeles/Godzilla-MIDI-Dataset)对应的Webdataset格式MIDI数据。
本数据集卡片旨在作为新建数据集的基础模板,其基于[该原始模板](https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/templates/datasetcard_template.md?plain=1)生成。
## 数据集详情
### 数据集描述
本数据集通过将MIDI文件拆分为1024个Token而构建。随后我们将训练集按70%训练、15%验证、15%测试的比例进行划分。我们基于miditok工具对MIDI数据进行数据增强:
augment_dataset(
subset_chunks_dir,
pitch_offsets=[-12, 12],
velocity_offsets=[-4, 4],
duration_offsets=[-1, 1]
)
随后将文件导出并保存为Webdataset格式:
sample = {
"__key__": str(midi_path.relative_to(root_dir_path)),
"id": i,
"midi_file": midi_data,
}
其键名为"mid.midi_file"。
### 数据集来源[可选]
- **仓库地址**:[哥斯拉MIDI数据集(Godzilla MIDI Dataset)](https://huggingface.co/datasets/projectlosangeles/Godzilla-MIDI-Dataset),来自洛杉矶项目(Project Los Angeles)
- **演示页面[可选]**:[MIDI即兴会话(Midi Jam Session)](https://huggingface.co/spaces/adricl/MidiJamSession)
## 应用场景
本数据集可用于训练基于MIDI的Transformer模型,或其他需要1024个MIDI分块的相关任务。
## 数据集结构
[更多信息待补充]
## 数据集构建
### 遴选依据
<!-- 本数据集的构建动机 -->
[更多信息待补充]
### 源数据
[哥斯拉MIDI数据集(Godzilla MIDI Dataset)](https://huggingface.co/datasets/projectlosangeles/Godzilla-MIDI-Dataset)
提供机构:
adricl



