five

GildasLeDrogoff/spotify-huge-track-analysis-dataset

收藏
Hugging Face2026-02-11 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/GildasLeDrogoff/spotify-huge-track-analysis-dataset
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-nc-4.0 author: Gildas Le Drogoff language: - en task_categories: - tabular-classification - tabular-regression tags: - spotify - music - music-analysis - audio-features - data-analysis - exploratory-data-analysis - eda - clustering - recommendation-systems - music-recommendation - regression - tabular-data - machine-learning dataset_type: tabular pretty_name: Spotify Huge Track Analysis Dataset size_categories: - 50M-100M dataset_summary: > Large-scale analytical dataset derived from Spotify, conceptually structured at the track level (track_id), but physically materialized at the track–artist level in accordance with Spotify’s credit model. It integrates track metadata, album context, popularity signals at the track / album / artist levels, temporal information, Spotify audio features, as well as derived comparative indicators. Only tracks with strictly positive popularity are included. --- # Spotify Track Analysis Dataset ## General Description This dataset provides a large-scale, research-oriented analytical representation of Spotify music data. It is centered on tracks as musical recordings (`track_id`), while preserving explicit artist attribution as defined by Spotify’s native credit model. Each row corresponds to a track–artist association, identified by: - a Spotify track identifier (`track_id`) - a credited artist name (`artist_name`) A single track may appear on multiple rows when it is associated with multiple artists (collaborations, featurings, compilations, collective works). This duplication is structural and intentional. All musical characteristics, album metadata, and popularity metrics are strictly invariant for a given `track_id`. No intra-track variation exists between rows sharing the same `track_id`. For size reduction and analytical relevance purposes, only tracks with strictly positive popularity (`track_popularity > 0`) have been retained. Tracks with no measurable exposure (zero popularity) were excluded upstream. ## Granularity and Data Model The dataset strictly follows the Spotify attribution model and must be interpreted across two distinct levels. ### Conceptual Level (Analytical) - `track_id` uniquely identifies a musical recording. - All numerical and descriptive variables are defined and stable at this level. ### Physical Level (Dataset Rows) - Each row corresponds to a unique `(track_id, artist_name)` pair. - A track credited to _N_ artists appears on _N_ distinct rows. - There is a bijective correspondence between: - the total number of rows - the number of distinct `(track_id, artist_name)` tuples This design choice preserves the full set of artist credits without compromising analytical consistency at the track level. ## Dataset Characteristics - 📂 File: spotify-huge-audio-features.parquet - 📏 Size: 4096.96 MB - 🧮 Total number of rows: 56,277,664 - 📊 Columns: 27 - 📦 Row groups: 57 - 🔧 Parquet version: 2.6 - 🏷️ Created by: ClickHouse version 26.1.2 The dataset is designed for batch-oriented analytical engines (DuckDB, Spark, Polars, Arrow, ClickHouse). It is not suitable for transactional or real-time workloads. ## Data Schema | Column name | Parquet type | Column name | Parquet type | | -------------------------- | ------------ | ------------------------- | ------------ | | track_id | BYTE_ARRAY | artist_name | BYTE_ARRAY | | track_name | BYTE_ARRAY | album_name | BYTE_ARRAY | | album_release_date | INT32 | duration_ms | INT32 | | explicit | INT32 | track_number | INT32 | | disc_number | INT32 | track_popularity | INT32 | | album_popularity | INT32 | track_vs_album_popularity | DOUBLE | | artist_popularity | INT32 | artist_followers | INT64 | | album_vs_artist_popularity | DOUBLE | tempo | DOUBLE | | key | INT32 | mode | INT32 | | danceability | DOUBLE | energy | DOUBLE | | loudness | DOUBLE | speechiness | DOUBLE | | acousticness | DOUBLE | instrumentalness | DOUBLE | | liveness | DOUBLE | valence | DOUBLE | | energy_danceability_score | DOUBLE | | | ## Field Details ### Identifiers and Labels - Spotify track identifier (`track_id`) - Track name - Album name - Credited artist name ### Popularity and Audience - Track popularity - Album popularity - Artist popularity - Artist follower count Popularity metrics are defined by Spotify and represent a single snapshot at the time of dataset construction. ### Comparative Popularity Indicators - Relative popularity of the track compared to the album - Relative popularity of the album compared to the artist These indicators enable inter-artist and inter-catalog comparisons, independent of differences in notoriety scale. ### Temporal Information - Album release date ### Structural Track Metadata - Duration (milliseconds) - Explicit content indicator - Track number - Disc number ### Musical Attributes - Tempo (BPM) - Musical key - Mode (major / minor) ### Spotify Audio Features - Danceability - Energy - Loudness - Speechiness - Acousticness - Instrumentalness - Liveness - Valence ### Composite Feature - `energy_danceability_score` Deterministic score combining energy and danceability, provided for ranking, segmentation, and exploratory analysis. ## Intended Use Cases This dataset is intended for offline analytical workflows, including: - Large-scale exploratory analysis of musical characteristics - Popularity-aware clustering and segmentation - Recommendation modeling based on audio features - Statistical analyses linking popularity and acoustic attributes - Comparative studies across artists, albums, and release periods - Benchmarking of large-scale data processing pipelines All numerical variables are stable at the `track_id` level, ensuring analytical consistency and reproducibility. ## Limitations - Popularity metrics are time-dependent and reflect a single snapshot. - Tracks with “zero” popularity according to Spotify are intentionally excluded. - Multiple rows may correspond to the same track due to multi-artist credits.
提供机构:
GildasLeDrogoff
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作