Dr3dre/Genius-song-lyrics-cleaned
收藏Hugging Face2026-01-08 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Dr3dre/Genius-song-lyrics-cleaned
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: title
dtype: string
- name: tag
dtype: string
- name: artist
dtype: string
- name: year
dtype: int64
- name: views
dtype: int64
- name: features
dtype: string
- name: lyrics
dtype: string
- name: id
dtype: int64
- name: language_cld3
dtype: string
- name: language_ft
dtype: string
- name: language
dtype: string
- name: lyrics_clean
dtype: string
- name: char_len
dtype: int64
- name: word_len
dtype: int64
splits:
- name: train
num_bytes: 15892159664
num_examples: 5134856
download_size: 9356490174
dataset_size: 15892159664
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
license: cc-by-4.0
task_categories:
- text-classification
- sentence-similarity
language:
- en
- it
- ia
pretty_name: Genius Song Lyrics cleaned
size_categories:
- 1M<n<10M
---
# 🎵 Genius Song Lyrics cleaned Dataset
## Dataset Description
This dataset is originally taken from [Genius Song Lyrics](https://www.kaggle.com/datasets/carlosgdcj/genius-song-lyrics-with-language-information) and it contains **cleaned and normalized song lyrics** for more than **5 million songs**, designed for **large-scale topic modeling, clustering, and semantic analysis**.
The dataset was specifically preprocessed to be compatible with **embedding-based models** (e.g. Sentence Transformers, BERTopic) while preserving lyrical meaning and thematic content.
Repetitive structures typical of song lyrics (e.g. choruses) have been handled carefully to avoid semantic bias.
---
## 🧾 Dataset Structure
Each example corresponds to **one song**.
### Fields
| Field | Type | Description |
|------|----|------------|
| `lyrics_clean` | `string` | Cleaned full lyrics |
| `char_len` | `int32` | Number of characters in `lyrics_clean` |
| `word_len` | `int32` | Number of whitespace-separated words |
| `artist` | `string` | Artist name *(optional)* |
| `title` | `string` | Song title *(optional)* |
| `genre` | `string` | Genre label *(if available)* |
| `language` | `string` | Language code (ISO-639-1) |
> Some metadata fields may be missing depending on source availability.
---
## 🧼 Preprocessing
The following preprocessing steps were applied:
1. **Removed structural tags**
- `[Chorus]`, `[Verse 1]`, `[Bridge]`, etc.
2. **Removed parenthetical repetition markers**
- `(x2)`, `(repeat)`, etc.
3. **Deduplicated repeated lyric lines**
- Common in choruses and hooks
4. **Lowercased text**
5. **Whitespace normalization**
No stemming, lemmatization, or stopword removal was applied to preserve semantic meaning for embedding-based models.
提供机构:
Dr3dre



