vancenceho/spotify-lyrics-clean

Name: vancenceho/spotify-lyrics-clean
Creator: vancenceho
Published: 2026-04-20 12:28:27
License: 暂无描述

Hugging Face2026-04-20 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/vancenceho/spotify-lyrics-clean

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cdla-sharing-1.0 language: - en tags: - music - code pretty_name: Spotify Million Song Lyrics Cleaned size_categories: - 10K<n<100K --- # Spotify Million Song Lyrics Cleaned CSV of **deduplicated, normalized lyrics** aligned to the Million Song–style raw lyrics dump used in the *viral-content-predictor* project. Each row is one **(artist, song)** identity after normalization; lyrics are cleaned text suitable for TF‑IDF, retrieval, or joining to Spotify metadata via `artist_norm` / `title_norm`. ## File | File | Role | |------|------| | `lyrics_cleaned.csv` | One row per normalized `(artist_norm, title_norm)`; includes raw display columns and lyric text. | ## Source - **Upstream:** Raw lyrics table (e.g. `spotify_millsongdata.csv` in `data/raw/`), derived from the **Spotify Million Song Dataset**–style corpus. - **Pipeline:** `notebooks/exploratory/explore_lyrics_clean.ipynb` — normalization, lyric text cleanup, **deduplication** (keeping the longest lyric per normalized artist–title pair where applicable). ## Schema (typical columns) | Column | Description | |--------|-------------| | `artist` | Original artist string. | | `song` | Original track title string. | | `artist_norm` | Normalized artist key for joins. | | `title_norm` | Normalized title key for joins. | | `lyrics_char_len` | Character length of the lyric text (or similar size signal). | | `lyrics` | Full lyric text (cleaned). | Exact column order and any extra fields follow the exporting notebook revision. ## Usage ```python import pandas as pd lyr = pd.read_csv("lyrics_cleaned.csv") # Join to Spotify-side tables on artist_norm / title_norm (and track_id after mapping) ``` ## Size note The **full** cleaned file can be **large** (multi‑million characters / many rows). If your snapshot is bigger than `10K<n<100K`, update the Hub **`size_categories`** in the YAML (e.g. `1M<n<10M` or `n>10M`) so the dataset card matches reality. ## Limitations - **Language:** Primarily **English** lyrics in practice; not language-tagged per row in this export. - **Rights:** Lyrics are third-party content; redistribution must respect **CDLA-Sharing-1.0** (this card), the original dataset license, and applicable copyright rules in your jurisdiction. ## Citation Cite the original **Million Song Dataset** / lyrics source you used, plus your repository or notebook revision for the cleaning steps.

提供机构：

vancenceho

5,000+

优质数据集

54 个

任务类型

进入经典数据集