vancenceho/spotify-tracks-clean
收藏Hugging Face2026-04-20 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/vancenceho/spotify-tracks-clean
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cdla-sharing-1.0
language:
- en
tags:
- music
- code
pretty_name: Spotify Tracks Cleaned
size_categories:
- 10K<n<100K
---
# Spotify Tracks Cleaned
CSV of **Spotify track metadata and audio features** after cleaning for downstream modeling (e.g. viral / popularity prediction, joins with lyrics or YouTube). Rows are unique tracks; the schema follows the common Kaggle-style “Spotify tracks” dump, normalized in `01a_clean_spotify.ipynb`.
## File
| File | Role |
|------|------|
| `spotify_tracks_cleaned.csv` | One row per track; cleaned text fields and consistent dtypes for ML preprocessing. |
## Source
- **Upstream:** Spotify track-level table (e.g. `spotify_tracks.csv` from `data/raw/`, or equivalent on Kaggle).
- **Pipeline:** `notebooks/01a_clean_spotify.ipynb` — loads raw CSV, applies cleaning, writes to `data/cleaned/spotify_tracks_cleaned.csv`.
## Cleaning (summary)
- Strip / normalize whitespace and casing on selected **text** columns (e.g. `artists`, `album_name`, `track_name`, `track_genre`).
- Collapse repeated internal spaces; empty strings → missing where appropriate.
- Additional steps in the notebook (missing-value handling, typing, deduplication if present) are documented in the notebook cells.
## Schema (typical columns)
Exact dtypes may vary by export; expect roughly:
| Column | Description |
|--------|-------------|
| `track_id` | Spotify track identifier (join key). |
| `artists`, `album_name`, `track_name` | Text metadata (cleaned). |
| `popularity` | Spotify popularity score. |
| `duration_ms` | Track length in milliseconds. |
| `explicit` | Explicit content flag. |
| `danceability` … `time_signature` | Spotify audio features (0–1 or Hz / BPM scales as in the source API export). |
| `track_genre` | Genre label (cleaned). |
An index column such as `Unnamed: 0` may appear if inherited from the raw CSV.
A full Kaggle-style export is often **~114k** rows; if so, use Hub `size_categories` **`100K<n<1M`** instead of the frontmatter value above (adjust the YAML to match your actual file).
## Usage
```python
import pandas as pd
df = pd.read_csv("spotify_tracks_cleaned.csv")
```
Use `track_id` to align with lyrics, YouTube, or audio-feature tables in the same project.
## Limitations
- **Not** real-time Spotify data; snapshot reflects the date of the raw pull.
- **License / ToS:** Redistribution must comply with **CDLA-Sharing-1.0** (this card), the original dataset license, and Spotify’s terms of use for derived datasets.
## Citation
If you use this artifact in research, cite the original Spotify / Kaggle dataset you built from, and reference your cleaning notebook or repository revision.
提供机构:
vancenceho



