vancenceho/spotify-lyrics-clean
收藏Hugging Face2026-04-20 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/vancenceho/spotify-lyrics-clean
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cdla-sharing-1.0
language:
- en
tags:
- music
- code
pretty_name: Spotify Million Song Lyrics Cleaned
size_categories:
- 10K<n<100K
---
# Spotify Million Song Lyrics Cleaned
CSV of **deduplicated, normalized lyrics** aligned to the Million Song–style raw lyrics dump used in the *viral-content-predictor* project. Each row is one **(artist, song)** identity after normalization; lyrics are cleaned text suitable for TF‑IDF, retrieval, or joining to Spotify metadata via `artist_norm` / `title_norm`.
## File
| File | Role |
|------|------|
| `lyrics_cleaned.csv` | One row per normalized `(artist_norm, title_norm)`; includes raw display columns and lyric text. |
## Source
- **Upstream:** Raw lyrics table (e.g. `spotify_millsongdata.csv` in `data/raw/`), derived from the **Spotify Million Song Dataset**–style corpus.
- **Pipeline:** `notebooks/exploratory/explore_lyrics_clean.ipynb` — normalization, lyric text cleanup, **deduplication** (keeping the longest lyric per normalized artist–title pair where applicable).
## Schema (typical columns)
| Column | Description |
|--------|-------------|
| `artist` | Original artist string. |
| `song` | Original track title string. |
| `artist_norm` | Normalized artist key for joins. |
| `title_norm` | Normalized title key for joins. |
| `lyrics_char_len` | Character length of the lyric text (or similar size signal). |
| `lyrics` | Full lyric text (cleaned). |
Exact column order and any extra fields follow the exporting notebook revision.
## Usage
```python
import pandas as pd
lyr = pd.read_csv("lyrics_cleaned.csv")
# Join to Spotify-side tables on artist_norm / title_norm (and track_id after mapping)
```
## Size note
The **full** cleaned file can be **large** (multi‑million characters / many rows). If your snapshot is bigger than `10K<n<100K`, update the Hub **`size_categories`** in the YAML (e.g. `1M<n<10M` or `n>10M`) so the dataset card matches reality.
## Limitations
- **Language:** Primarily **English** lyrics in practice; not language-tagged per row in this export.
- **Rights:** Lyrics are third-party content; redistribution must respect **CDLA-Sharing-1.0** (this card), the original dataset license, and applicable copyright rules in your jurisdiction.
## Citation
Cite the original **Million Song Dataset** / lyrics source you used, plus your repository or notebook revision for the cleaning steps.
提供机构:
vancenceho



