five

vancenceho/spotify-tracks-clean

收藏
Hugging Face2026-04-20 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/vancenceho/spotify-tracks-clean
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cdla-sharing-1.0 language: - en tags: - music - code pretty_name: Spotify Tracks Cleaned size_categories: - 10K<n<100K --- # Spotify Tracks Cleaned CSV of **Spotify track metadata and audio features** after cleaning for downstream modeling (e.g. viral / popularity prediction, joins with lyrics or YouTube). Rows are unique tracks; the schema follows the common Kaggle-style “Spotify tracks” dump, normalized in `01a_clean_spotify.ipynb`. ## File | File | Role | |------|------| | `spotify_tracks_cleaned.csv` | One row per track; cleaned text fields and consistent dtypes for ML preprocessing. | ## Source - **Upstream:** Spotify track-level table (e.g. `spotify_tracks.csv` from `data/raw/`, or equivalent on Kaggle). - **Pipeline:** `notebooks/01a_clean_spotify.ipynb` — loads raw CSV, applies cleaning, writes to `data/cleaned/spotify_tracks_cleaned.csv`. ## Cleaning (summary) - Strip / normalize whitespace and casing on selected **text** columns (e.g. `artists`, `album_name`, `track_name`, `track_genre`). - Collapse repeated internal spaces; empty strings → missing where appropriate. - Additional steps in the notebook (missing-value handling, typing, deduplication if present) are documented in the notebook cells. ## Schema (typical columns) Exact dtypes may vary by export; expect roughly: | Column | Description | |--------|-------------| | `track_id` | Spotify track identifier (join key). | | `artists`, `album_name`, `track_name` | Text metadata (cleaned). | | `popularity` | Spotify popularity score. | | `duration_ms` | Track length in milliseconds. | | `explicit` | Explicit content flag. | | `danceability` … `time_signature` | Spotify audio features (0–1 or Hz / BPM scales as in the source API export). | | `track_genre` | Genre label (cleaned). | An index column such as `Unnamed: 0` may appear if inherited from the raw CSV. A full Kaggle-style export is often **~114k** rows; if so, use Hub `size_categories` **`100K<n<1M`** instead of the frontmatter value above (adjust the YAML to match your actual file). ## Usage ```python import pandas as pd df = pd.read_csv("spotify_tracks_cleaned.csv") ``` Use `track_id` to align with lyrics, YouTube, or audio-feature tables in the same project. ## Limitations - **Not** real-time Spotify data; snapshot reflects the date of the raw pull. - **License / ToS:** Redistribution must comply with **CDLA-Sharing-1.0** (this card), the original dataset license, and Spotify’s terms of use for derived datasets. ## Citation If you use this artifact in research, cite the original Spotify / Kaggle dataset you built from, and reference your cleaning notebook or repository revision.
提供机构:
vancenceho
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作