five

vancenceho/spotify-lyrics-clean

收藏
Hugging Face2026-04-20 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/vancenceho/spotify-lyrics-clean
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cdla-sharing-1.0 language: - en tags: - music - code pretty_name: Spotify Million Song Lyrics Cleaned size_categories: - 10K<n<100K --- # Spotify Million Song Lyrics Cleaned CSV of **deduplicated, normalized lyrics** aligned to the Million Song–style raw lyrics dump used in the *viral-content-predictor* project. Each row is one **(artist, song)** identity after normalization; lyrics are cleaned text suitable for TF‑IDF, retrieval, or joining to Spotify metadata via `artist_norm` / `title_norm`. ## File | File | Role | |------|------| | `lyrics_cleaned.csv` | One row per normalized `(artist_norm, title_norm)`; includes raw display columns and lyric text. | ## Source - **Upstream:** Raw lyrics table (e.g. `spotify_millsongdata.csv` in `data/raw/`), derived from the **Spotify Million Song Dataset**–style corpus. - **Pipeline:** `notebooks/exploratory/explore_lyrics_clean.ipynb` — normalization, lyric text cleanup, **deduplication** (keeping the longest lyric per normalized artist–title pair where applicable). ## Schema (typical columns) | Column | Description | |--------|-------------| | `artist` | Original artist string. | | `song` | Original track title string. | | `artist_norm` | Normalized artist key for joins. | | `title_norm` | Normalized title key for joins. | | `lyrics_char_len` | Character length of the lyric text (or similar size signal). | | `lyrics` | Full lyric text (cleaned). | Exact column order and any extra fields follow the exporting notebook revision. ## Usage ```python import pandas as pd lyr = pd.read_csv("lyrics_cleaned.csv") # Join to Spotify-side tables on artist_norm / title_norm (and track_id after mapping) ``` ## Size note The **full** cleaned file can be **large** (multi‑million characters / many rows). If your snapshot is bigger than `10K<n<100K`, update the Hub **`size_categories`** in the YAML (e.g. `1M<n<10M` or `n>10M`) so the dataset card matches reality. ## Limitations - **Language:** Primarily **English** lyrics in practice; not language-tagged per row in this export. - **Rights:** Lyrics are third-party content; redistribution must respect **CDLA-Sharing-1.0** (this card), the original dataset license, and applicable copyright rules in your jurisdiction. ## Citation Cite the original **Million Song Dataset** / lyrics source you used, plus your repository or notebook revision for the cleaning steps.
提供机构:
vancenceho
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作