asigalov61/Lyrics-MIDI-Dataset
收藏Hugging Face2025-11-25 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/asigalov61/Lyrics-MIDI-Dataset
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-nc-sa-4.0
language:
- en
tags:
- lyrics
- MIDI
- song-lyrics
- midi-lyrics
- lyrics-midi
- karaoke
- music
pretty_name: lyricsmidi
size_categories:
- 100K<n<1M
attachments:
- path: Lyrics-MIDI-Dataset-CC-BY-NC-SA.zip
description: "Complete archive containing MIDIs and lyrics"
---
# Lyrics MIDI Dataset
## ~179k original MIDI files with matched lyrics

***
## Overview
The Lyrics MIDI Dataset is a large-scale multimodal collection of symbolic music files paired with corresponding lyrics in plain text. It enables research on lyric-conditioned music generation, alignment between textual and musical representations, and cross-domain tasks that leverage both modalities. All MIDIs are original, sourced from established datasets; lyric files are matched at high confidence to support reliable training and evaluation.
---
### Composition and statistics
- **Total scope:** 179,562 original MIDI files with respective lyric files in `.txt` format.
- **Clean subset (deduped):** 47,537 MIDI/lyrics pairs for benchmarking, ablation studies, and reproducible experiments.
- **Full set (non-deduped):** 179,562 MIDI/lyrics pairs providing diversity and coverage across styles and sources.
- **Match confidence:** Lyrics were matched at confidence scores between 0.9 and 1.0 (90–100%), emphasizing strong text–symbolic alignment and match precision.
- **Sources:** MIDIs originate from established public datasets (e.g., Lakh MIDI, MetaMIDI, Tegridy, Sourdough MIDI, Popular Hook), maintaining authenticity and traceability to original compilers and creators.
---
### Data format and structure
- **MIDI files:** Standard `.mid` files containing symbolic note, timing, and controller information compatible with common DAWs and MIR toolkits.
- **Lyric files:** Paired `.txt` files with song lyrics in plain UTF-8 text for direct ingestion by NLP pipelines.
- **Pairing convention:** Each MIDI has a corresponding lyric file; directory organization and filename conventions are consistent to facilitate deterministic loading.
- **Supplemental Python Code:** The dataset comes with handy supplemental Python code from [tegridy-tools](https://github.com/asigalov61/tegridy-tools) for ease of use and convenience
---
### Matching and deduplication
- **High-confidence pairing:** Matches were accepted only in the 0.9–1.0 confidence range to reduce false alignments and support trustworthy multimodal training.
- **Deduped subset:** A curated set of 47,537 unique MIDI/lyrics pairs is provided to minimize repeated content and near-identical variations, suitable for benchmarking and model validation.
- **Non-deduped subset:** A larger set of 179,562 pairs captures stylistic breadth, source diversity, and alternative versions, useful for pretraining and robustness studies.
- **Provenance:** Credit and ownership for the content remain with the original source datasets and creators; pairing preserves traceability and respects dataset boundaries.
---
### Suggested use cases
- **Lyric-conditioned music generation:** Train sequence-to-sequence or diffusion-style models that synthesize MIDI from lyrics, enabling controllable, text-driven composition.
- **Multimodal alignment and retrieval:** Learn joint embeddings to retrieve lyrics from MIDI (and vice versa), or to align textual themes with musical structure.
- **Emotion and theme modeling:** Map lyric sentiment/emotion to musical features (tempo, key, chord progressions) for affect-aware composition and analysis.
- **Genre classification and style transfer:** Use paired text–music signals to improve genre labeling and to guide stylistic transformations conditioned on lyric content.
- **Structure and segmentation:** Investigate correlations between lyrical form (verses, choruses) and musical sections for segmentation, hook detection, and arrangement tasks.
- **Evaluation benchmarks:** Utilize the deduped subset for reproducible benchmarks, ablations, and diagnostic testing of multimodal modeling pipelines.
---
### Notes and considerations
- **Data quality:** High-confidence matching favors precision; users may optionally apply stricter filters or additional heuristics for domain-specific needs.
- **Reproducibility:** The deduped subset is recommended for baselines and published benchmarks; the non-deduped subset is better for broader pretraining.
- **Ethical use:** Respect original creators’ rights and dataset licenses; avoid generating content that misrepresents attribution or implies ownership beyond permitted use.
***
## License
- **License:** CC-BY-NC-SA 4.0.
- **Attribution:** Credit for MIDIs and lyrics belongs to the respective source datasets and the original creators who compiled them.
- **Non-commercial use:** Redistribution and derivative works must be non-commercial, provide attribution, and share under the same license.
***
## Attribution
### Source Lyrics Datasets (Hugging Face)
- [smgriffin/modern-pop-lyrics](https://huggingface.co/datasets/smgriffin/modern-pop-lyrics) — ~17k curated modern pop lyrics sourced from Genius, useful for NLP and lyric analysis
- [ernestchu/lyrics-emotion-classification](https://huggingface.co/datasets/ernestchu/lyrics-emotion-classification) — ~20k lyrics labeled with emotional categories for text classification tasks
- [aifeifei798/song_lyrics_min](https://huggingface.co/datasets/aifeifei798/song_lyrics_min) — Massive dataset (~3.3M rows) of multilingual song lyrics for large‑scale training
- [Yegor25/lyrics_genre_dataset_large](https://huggingface.co/datasets/Yegor25/lyrics_genre_dataset_large) — Large dataset of lyrics with genre labels for supervised genre classification
- [mrYou/lyrics-dataset](https://huggingface.co/datasets/mrYou/lyrics-dataset) — ~30k songs with metadata (artist, year, views) and lyrics for general NLP tasks
- [mrYou/Lyrics_eng_dataset](https://huggingface.co/datasets/mrYou/lyrics-dataset) — English subset of mrYou’s lyrics dataset, focused on English‑language songs
- [PJMixers-Dev/bigdata-pw_Lyrics1M-en](https://huggingface.co/datasets/bigdata-pw/Lyrics1M) — 1M+ English lyrics with artist/title metadata, aligned with Spotify tracks
- [SpartanCinder/song-lyrics-artist-classifier](https://huggingface.co/datasets/SpartanCinder/song-lyrics-artist-classifier) — ~13k songs labeled by artist for lyric‑based artist classification
- [tsterbak/lyrics-dataset](https://huggingface.co/datasets/tsterbak/lyrics-dataset) — ~158k songs with artist and lyric text, suitable for large‑scale lyric modeling
- [NEXTLab-ZJU/popular-hook](https://huggingface.co/datasets/NEXTLab-ZJU/popular-hook) — Multimodal dataset of ~38k musical “hooks” with MIDI, lyrics, audio, and emotion annotations
### Source MIDI Datasets (Hugging Face)
- [NEXTLab-ZJU/popular-hook](https://huggingface.co/datasets/NEXTLab-ZJU/popular-hook) — Musical hooks dataset with aligned MIDI, lyrics, audio, and metadata
- [BreadAi/Sourdough-midi-dataset](https://huggingface.co/datasets/BreadAi/Sourdough-midi-dataset) — Largest public MIDI dataset (~5M files), deduplicated for symbolic music modeling
### Source MIDI Datasets (Other)
- [Lakh MIDI Dataset](https://colinraffel.com/projects/lmd/) — 176k MIDI files, with 45k aligned to the Million Song Dataset for MIR research
- [MetaMIDI Dataset](https://github.com/jeffreyjohnens/MetaMIDIDataset) — 436k MIDI files with metadata, matched to Spotify and MusicBrainz tracks
- [Tegridy MIDI Dataset](https://github.com/asigalov61/Tegridy-MIDI-Dataset) — Comprehensive symbolic MIDI dataset curated for training precise music AI models
***
## Citations
```bibtex
@misc{NEXTLabZJU2023PopularHook,
author = {NEXTLab-ZJU},
title = {Popular Hook Dataset},
year = {2023},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/datasets/NEXTLab-ZJU/popular-hook}},
note = {Multimodal dataset of musical hooks with MIDI, lyrics, audio, and annotations}
}
```
```bibtex
@misc{BreadAi2024SourdoughMIDI,
author = {BreadAi},
title = {Sourdough MIDI Dataset},
year = {2024},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/datasets/BreadAi/Sourdough-midi-dataset}},
note = {Large-scale deduplicated MIDI dataset for symbolic music modeling}
}
```
```bibtex
@misc{Raffel2016LakhMIDI,
author = {Colin Raffel},
title = {The Lakh MIDI Dataset},
year = {2016},
howpublished = {\url{https://colinraffel.com/projects/lmd/}},
note = {176k MIDI files with 45k aligned to the Million Song Dataset}
}
```
```bibtex
@misc{Johnens2020MetaMIDI,
author = {Jeffrey Johnens},
title = {MetaMIDI Dataset},
year = {2020},
howpublished = {\url{https://github.com/jeffreyjohnens/MetaMIDIDataset}},
note = {436k MIDI files with metadata matched to Spotify and MusicBrainz}
}
```
```bibtex
@misc{Asigalov2021TegridyMIDI,
author = {Alex Lev},
title = {Tegridy MIDI Dataset: },
year = {2021},
howpublished = {\url{https://github.com/asigalov61/Tegridy-MIDI-Dataset}},
note = {Ultimate Multi-Instrumental MIDI Dataset for MIR and Music AI purposes}
}
```
***
### Project Los Angeles
### Tegridy Code 2025
---
许可证:cc-by-nc-sa-4.0
语言:
- 英文
标签:
- lyrics
- MIDI
- song-lyrics
- midi-lyrics
- lyrics-midi
- karaoke
- music
漂亮名称:lyricsmidi
大小分类:
- 100K<n<1M
附件:
- 路径:Lyrics-MIDI-Dataset-CC-BY-NC-SA.zip
描述:"包含MIDI和歌词的完整归档"
---
# 歌词MIDI数据集
## 约179k个带匹配歌词的原始MIDI文件

***
## 概述
歌词MIDI数据集是一个大规模多模态集合,包含符号音乐文件及其对应的纯文本歌词。它支持歌词条件音乐生成、文本与音乐表示之间的对齐以及利用两种模态的跨域任务研究。所有MIDI均为原始数据,源自已建立的数据集;歌词文件以高置信度匹配,以支持可靠的训练和评估。
---
### 组成与统计
- **总规模**:179,562个带对应.txt歌词文件的原始MIDI文件
- **干净子集(去重)**:47,537个MIDI/歌词对,用于基准测试、消融研究和可重复实验
- **全集(未去重)**:179,562个MIDI/歌词对,提供跨风格和来源的多样性覆盖
- **匹配置信度**:歌词匹配置信度在0.9到1.0之间(90–100%),强调强文本-符号对齐和匹配精度
- **来源**:MIDI源自已建立的公共数据集(如Lakh MIDI、MetaMIDI、Tegridy、Sourdough MIDI、Popular Hook),保持真实性和对原始编译者及创作者的可追溯性
---
### 数据格式与结构
- **MIDI文件**:标准.mid文件,包含符号音符、时序和控制器信息,兼容常见DAW和MIR工具包
- **歌词文件**:配对的.txt文件,包含纯UTF-8文本歌词,可直接被NLP流水线摄入
- **配对约定**:每个MIDI有对应的歌词文件;目录组织和文件名约定一致,便于确定性加载
- **补充Python代码**:数据集附带来自[tegridy-tools](https://github.com/asigalov61/tegridy-tools)的便捷补充Python代码,以提升易用性
---
### 匹配与去重
- **高置信度配对**:仅接受0.9–
1.0置信度范围内的匹配,以减少错误对齐并支持可信的多模态训练
- **去重子集**:提供47,537个独特MIDI/歌词对的精选集合,以最小化重复内容和近相同变体,适用于基准测试和模型验证
- **未去重子集**:179,562个配对的更大集合,捕捉风格广度、来源多样性和替代版本,适用于预训练和鲁棒性研究
- **溯源性**:内容的信用和所有权归原始来源数据集和创作者;配对保持可追溯性并尊重数据集边界
---
### 建议用例
- **歌词条件音乐生成**:训练序列到序列或扩散式模型,从歌词合成MIDI,实现可控的文本驱动创作
- **多模态对齐与检索**:学习联合嵌入以从MIDI检索歌词(反之亦然),或对齐文本主题与音乐结构
- **情感与主题建模**:将歌词 sentiment/情感映射到音乐特征( tempo、调式、和弦进行),用于情感感知创作和分析
- **流派分类与风格迁移**:使用配对文本-音乐信号改进流派标注,并指导基于歌词内容的风格转换
- **结构与分割**:研究歌词形式( verse、chorus)与音乐段落之间的相关性,用于分割、钩子检测和编曲任务
- **评估基准**:利用去重子集进行可重复基准测试、消融实验和多模态建模流水线的诊断测试
---
### 注意事项
- **数据质量**:高置信度匹配优先考虑精度;用户可根据特定领域需求选择应用更严格的过滤器或附加启发式方法
- **可重复性**:推荐去重子集用于基线和已发表基准测试;未去重子集更适合广泛预训练
- **伦理使用**:尊重原始创作者权利和数据集许可证;避免生成误导归因或暗示超出允许使用范围所有权的内容
***
## 许可证
- **许可证**:CC-BY-NC-SA 4.0
- **归因**:MIDI和歌词的信用归各自来源数据集和编译它们的原始创作者
- **非商业使用**:再分发和衍生作品必须非商业,提供归因,并在相同许可证下共享
***
## 归因
### 来源歌词数据集(Hugging Face)
- [smgriffin/modern-pop-lyrics](https://huggingface.co/datasets/smgriffin/modern-pop-lyrics) — ~17k个精选现代流行歌词,源自Genius,适用于NLP和歌词分析
- [ernestchu/lyrics-emotion-classification](https://huggingface.co/datasets/ernestchu/lyrics-emotion-classification) — ~20k个带情感类别的歌词,用于文本分类任务
- [aifeifei798/song_lyrics_min](https://huggingface.co/datasets/aifeifei798/song_lyrics_min) — 大规模数据集(~3.3M行),包含多语言歌曲歌词,用于大规模训练
- [Yegor25/lyrics_genre_dataset_large](https://huggingface.co/datasets/Yegor25/lyrics_genre_dataset_large) — 带流派标签的大型歌词数据集,用于监督流派分类
- [mrYou/lyrics-dataset](https://huggingface.co/datasets/mrYou/lyrics-dataset) — ~30k首带元数据(艺术家、年份、 views)和歌词的歌曲,用于通用NLP任务
- [mrYou/Lyrics_eng_dataset](https://huggingface.co/datasets/mrYou/lyrics-dataset) — mrYou歌词数据集的英文子集,专注于英文歌曲
- [PJMixers-Dev/bigdata-pw_Lyrics1M-en](https://huggingface.co/datasets/bigdata-pw/Lyrics1M) — 1M+带艺术家/标题元数据的英文歌词,与Spotify曲目对齐
- [SpartanCinder/song-lyrics-artist-classifier](https://huggingface.co/datasets/SpartanCinder/song-lyrics-artist-classifier) — ~13k首带艺术家标签的歌曲,用于基于歌词的艺术家分类
- [tsterbak/lyrics-dataset](https://huggingface.co/datasets/tsterbak/lyrics-dataset) — ~158k首带艺术家和歌词文本的歌曲,适用于大规模歌词建模
- [NEXTLab-ZJU/popular-hook](https://huggingface.co/datasets/NEXTLab-ZJU/popular-hook) — ~38k个音乐“钩子”的多模态数据集,带MIDI、歌词、音频和情感标注
### 来源MIDI数据集(Hugging Face)
- [NEXTLab-ZJU/popular-hook](https://huggingface.co/datasets/NEXTLab-ZJU/popular-hook) — 带对齐MIDI、歌词、音频和元数据的音乐钩子数据集
- [BreadAi/Sourdough-midi-dataset](https://huggingface.co/datasets/BreadAi/Sourdough-midi-dataset) — 最大的公共MIDI数据集(~5M文件),已去重用于符号音乐建模
### 来源MIDI数据集(其他)
- [Lakh MIDI Dataset](https://colinraffel.com/projects/lmd/) — 176k个MIDI文件,其中45k个与百万歌曲数据集对齐用于MIR研究
- [MetaMIDI Dataset](https://github.com/jeffreyjohnens/MetaMIDIDataset) — 436k个带元数据的MIDI文件,与Spotify和MusicBrainz曲目匹配
- [Tegridy MIDI Dataset](https://github.com/asigalov61/Tegridy-MIDI-Dataset) — 为训练精确音乐AI模型而精选的综合符号MIDI数据集
***
## 引用
bibtex
@misc{NEXTLabZJU2023PopularHook,
author = {NEXTLab-ZJU},
title = {Popular Hook Dataset},
year = {2023},
publisher = {Hugging Face},
howpublished = {url{https://huggingface.co/datasets/NEXTLab-ZJU/popular-hook}},
note = {Multimodal dataset of musical hooks with MIDI, lyrics, audio, and annotations}
}
bibtex
@misc{BreadAi2024SourdoughMIDI,
author = {BreadAi},
title = {Sourdough MIDI Dataset},
year = {2024},
publisher = {Hugging Face},
howpublished = {url{https://huggingface.co/datasets/BreadAi/Sourdough-midi-dataset}},
note = {Large-scale deduplicated MIDI dataset for symbolic music modeling}
}
bibtex
@misc{Raffel2016LakhMIDI,
author = {Colin Raffel},
title = {The Lakh MIDI Dataset},
year = {2016},
howpublished = {url{https://colinraffel.com/projects/lmd/}},
note = {176k MIDI files with 45k aligned to the Million Song Dataset for MIR research}
}
bibtex
@misc{Johnens2020MetaMIDI,
author = {Jeffrey Johnens},
title = {MetaMIDI Dataset},
year = {2020},
howpublished = {url{https://github.com/jeffreyjohnens/MetaMIDIDataset}},
note = {436k MIDI files with metadata matched to Spotify and MusicBrainz tracks}
}
bibtex
@misc{Asigalov2021TegridyMIDI,
author = {Alex Lev},
title = {Tegridy MIDI Dataset: },
year = {2021},
howpublished = {url{https://github.com/asigalov61/Tegridy-MIDI-Dataset}},
note = {Ultimate Multi-Instrumental MIDI Dataset for MIR and Music AI purposes}
}
***
### 洛杉矶项目
### Tegridy代码2025
提供机构:
asigalov61



