five

Noam12345/my_first_asignment_ds

收藏
Hugging Face2026-04-11 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/Noam12345/my_first_asignment_ds
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit task_categories: - tabular-classification language: - en tags: - music - spotify - audio-features - eda - popularity - hit-prediction size_categories: - 10K<n<100K --- # 🎵 What Makes a Hit Song? — Spotify EDA > **Assignment #1 — EDA & Dataset | March 2026** --- ## 📽️ Video Presentation *[Add your video link here]* --- ## 📌 Overview This project explores the **30,000 Spotify Songs** dataset to answer one core question: > **How do audio features — danceability, energy, loudness, speechiness, tempo, acousticness, instrumentalness, liveness, valence, and duration — affect the popularity of a song?** The goal is to identify the optimal range of audio features that music producers should aim for to increase the chances of their song becoming a hit. --- ## 📂 Dataset | Property | Details | |---|---| | **Source** | [Kaggle — 30000 Spotify Songs](https://www.kaggle.com/datasets/joebeachcapital/30000-spotify-songs) | | **Raw Size** | 32,833 rows × 23 features | | **Final Size** | 32,790 rows × 27 features (after cleaning + feature engineering) | | **Target Variable** | `track_popularity` (0–100) | | **Hit Definition** | `track_popularity > 75` (top ~8.1% of all tracks) | ### Audio Features Explained | Feature | What It Means | Example | |---|---|---| | `danceability` | How easy it is to dance to (0.0–1.0) | Low = slow ballad, High = club banger | | `energy` | Perceptual intensity and activity (0.0–1.0) | Death metal = high, Bach prelude = low | | `loudness` | Average loudness in decibels (dB) | Closer to 0 = louder, −60 = almost silent | | `speechiness` | Presence of spoken words (0.0–1.0) | >0.66 = mostly speech, <0.33 = music | | `acousticness` | Confidence that a track is acoustic (0.0–1.0) | Guitar unplugged = high, EDM synths = low | | `instrumentalness` | Likelihood the track has no vocals (0.0–1.0) | Jazz instrumental = high, pop singer = low | | `liveness` | Presence of a live audience (0.0–1.0) | >0.8 = likely live recording | | `valence` | Musical positiveness (0.0–1.0) | Happy/cheerful = high, sad/angry = low | | `tempo` | Estimated tempo in BPM | 60 BPM = slow, 180 BPM = very fast | | `duration_min` | Song length in minutes | 2 min = short, 6 min = long | --- ## 🛠️ Setup ```python from google.colab import drive import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns ``` Developed in **Google Colab** with the dataset loaded from Google Drive. --- ## 🧹 Data Cleaning **Missing Values:** Found 5 rows with missing `track_name`, `track_artist`, and `track_album_name` — all removed. **Duplicates:** None found — no action needed. **Impossible Values:** - `tempo = 0 BPM` → dropped (1 track). A song with 0 BPM cannot exist. - `loudness > 0 dB` → dropped (6 tracks). Industry standard caps loudness at 0 dB. **Final shape after cleaning: (32,790, 27)** --- ## ⚙️ Feature Engineering Three new features were created: - **`release_year`** — extracted from `track_album_release_date` (covers 1957–2020) - **`is_hit`** — `True` if `track_popularity > 75` (top ~8.1% of tracks by popularity score) - **`loudness_scaled`** — loudness rescaled from dB (−60 to 0) to a 0–100 range for readability --- ## 📊 Visualizations & Key Findings ### 1. Outlier Detection — All 10 Audio Features ![Outlier Detection](https://huggingface.co/datasets/Noam12345/my_first_asignment_ds/resolve/main/PHOTOS_NEW/OUTLIERS.png) Box plots across all 10 features reveal the spread and outliers in the dataset. Key observations: - **Tempo** clusters tightly between 100–130 BPM - **Instrumentalness** is nearly flat near 0 — ~90% of tracks have vocals - **Speechiness & Liveness** are heavily right-skewed — most tracks are studio recordings with minimal spoken word - **Valence** is evenly spread, reflecting a healthy mix of happy and sad music All outliers were **kept** — extreme values (e.g. 10-min songs, instrumental tracks) are valid artistic choices, not data errors. --- ### 2. Hit Rate Threshold Sensitivity ![Hit Rate Threshold](https://huggingface.co/datasets/Noam12345/my_first_asignment_ds/resolve/main/PHOTOS_NEW/HIT_RATE_THRESH.png) To define "hit" objectively, every threshold from 60 to 85 was tested. Results: | Threshold | Hit Rate | Decision | |---|---|---| | 60 | 27.2% | Too generous — over a quarter of songs can't all be hits | | 65 | 20.1% | Still too broad | | 70 | 13.5% | Getting closer | | **75** | **8.1%** | ✅ Selected — meaningful minority, large enough to analyze | | 80 | 4.1% | Too strict — loses too much signal | | 85 | 1.9% | Only mega-hits qualify | **Threshold = 75** was selected as the sweet spot. --- ### 3. Popularity Distribution & Hit/Non-Hit Split ![Popularity Distribution](https://huggingface.co/datasets/Noam12345/my_first_asignment_ds/resolve/main/PHOTOS_NEW/HIT_THRESH_VALID.png) The popularity distribution shows a natural drop-off above 70, confirming that 75 sits right at the edge of the top tier. Result: **2,657 hits (8.1%)** vs **30,133 non-hits (91.9%)**. For every 1 hit, there are ~11 non-hit songs. --- ### 4. Hits vs Non-Hits — Median Feature Comparison ![Median Comparison](https://huggingface.co/datasets/Noam12345/my_first_asignment_ds/resolve/main/PHOTOS_NEW/MEDIAN_COMP.png) All 10 features normalized to 0–1 and compared between hits and non-hits: - **↑ Danceability** — hits score consistently higher. Groove matters. - **↑ Loudness** — hits are louder. Tight mastering is industry standard. - **↑ Acousticness** — hits have a slightly more organic sound. - **↑ Valence** — hits trend happier. Feel-good music has a streaming edge. - **↓ Speechiness** — hits are more melodic, less spoken-word. - **↓ Instrumentalness** — hits almost always have vocals. - **↓ Liveness** — hits are studio recordings, not live performances. - **↓ Tempo** — hits are slightly slower (~118 BPM vs ~122 BPM average). - **↓ Duration** — hits are shorter. Streaming rewards conciseness. - **~ Energy** — similar between groups; hits are slightly lower in energy. --- ### 5. Correlation Heatmap — All Audio Features ![Correlation Heatmap](https://huggingface.co/datasets/Noam12345/my_first_asignment_ds/resolve/main/PHOTOS_NEW/HEATMAP.png) Key relationships revealed by the heatmap: - **Energy ↔ Loudness (`r = +0.68`)** — the strongest link. Louder songs almost always feel more energetic. - **Energy ↔ Acousticness (`r = −0.54`)** — acoustic tracks feel calmer and less intense. - **Loudness ↔ Acousticness (`r = −0.36`)** — louder songs are less acoustic. - **Danceability ↔ Valence (`r = +0.33`)** — happier songs tend to be more danceable. - **Instrumentalness ↔ Popularity (`r = −0.15`)** — the clearest predictor: vocals = popularity. - **Duration ↔ Popularity (`r = −0.14`)** — shorter songs perform better. All other features vs popularity show correlations below 0.10 — no single feature is a magic predictor on its own. --- ### 6. Audio Features vs Track Popularity — Scatter Plots ![Scatter Plots](https://huggingface.co/datasets/Noam12345/my_first_asignment_ds/resolve/main/PHOTOS_NEW/SCATER.png) Each feature plotted against popularity with a linear trend line (`r` values shown): - **Instrumentalness (`r = −0.15`)** and **Duration (`r = −0.14`)** show the clearest negative trends. - **Acousticness (`r = +0.09`)** is the strongest positive predictor — a touch of organic sound helps. - **Tempo (`r ≈ 0.00`)** is essentially flat — tempo alone does not drive popularity. - **Energy, Speechiness, Liveness, Valence** all show weak but directionally consistent trends. --- ### 7. Music Trends Over Time ![Music Trends Over Time](https://huggingface.co/datasets/Noam12345/my_first_asignment_ds/resolve/main/PHOTOS_NEW/TIME.png) Hit rates peaked in the mid-1960s (~17%) and surged again post-2015, reaching ~14% by 2019. **Why recent songs score higher:** Spotify's popularity score is recency-weighted — it favors tracks with recent, high-volume streaming activity over all-time totals. Combined with the viral amplification of social media (TikTok, Instagram Reels), newer songs accumulate streams far faster than songs from previous decades could. This is a measurement artifact, not evidence that modern music is objectively better. --- ### 8. Hit Ratio by Genre ![Hit Ratio by Genre](https://huggingface.co/datasets/Noam12345/my_first_asignment_ds/resolve/main/PHOTOS_NEW/GENERE.png) | Genre | Hit Ratio | Notes | |---|---|---| | **Pop** | **0.157** | Nearly double the overall average — designed for mass appeal | | **Latin** | **0.138** | Strong global wave (Bad Bunny, J Balvin) shows in the data | | **R&B** | 0.098 | Above average — loyal streaming audience | | **Rap** | 0.060 | Below average | | **EDM** | 0.044 | Lowest — EDM success lives in festivals, not Spotify streams | --- ### 9. Sweet Spot by Genre — KDE Analysis ![Sweet Spot by Genre](https://huggingface.co/datasets/Noam12345/my_first_asignment_ds/resolve/main/PHOTOS_NEW/SWEET_SPOT.png) KDE plots showing where hit songs cluster for **Pop, Latin, and R&B** across all 10 features: **Universal across all genres:** - High energy (0.6–0.8), high loudness, near-zero instrumentalness and speechiness, low liveness, short duration (3–4 min) **Genre-specific differences:** - **Tempo:** Pop/R&B peak at 95–125 BPM; Latin shows a second peak at ~175 BPM (reggaeton) - **Danceability:** Latin requires the highest (~0.78), R&B allows more variation - **Valence:** Pop and Latin lean happier; R&B spans a wider emotional range - **Acousticness:** R&B allows more organic elements than Pop --- ## ✅ Summary & Conclusions | Feature | Direction | Hit Sweet Spot | |---|---|---| | Danceability | ↑ positive | 0.70–0.85 | | Energy | ~ neutral | 0.60–0.80 | | Loudness | ↑ positive | 85–90 (scaled) | | Speechiness | ↓ negative | < 0.10 | | Tempo | ~ irrelevant | Genre-dependent | | Acousticness | ↑ slight positive | Slightly above average | | Instrumentalness | ↓ strong negative | ~0.0 (vocals essential) | | Liveness | ↓ negative | < 0.15 (studio only) | | Valence | ↑ slight positive | Slightly above average | | Duration | ↓ negative | 3–4 minutes | **Genre matters more than any single feature.** Pop and Latin produce hits at nearly double the rate of EDM. The real differentiation within a genre comes down to tempo and valence — that's where genre identity lives. --- ## 👤 Author **Noam Fuchs** — Data Science Assignment #1, March 2026

license: MIT许可证 task_categories: - 表格分类(tabular-classification) language: - 英语 tags: - 音乐 - Spotify - 音频特征(audio-features) - 探索性数据分析(Exploratory Data Analysis,简称EDA) - 流行度 - 爆款预测(hit-prediction) size_categories: - 10000 < 样本量 < 100000 --- # 🎵 究竟是什么成就了爆款歌曲?——Spotify 探索性数据分析(EDA) > **作业1 — 探索性数据分析与数据集 | 2026年3月** --- ## 📽️ 视频演示 *[请在此处添加视频链接]* --- ## 📌 项目概述 本项目依托**30000首Spotify歌曲**数据集,旨在解答一个核心问题: > **歌曲的音频特征——舞蹈适配度(danceability)、能量值(energy)、响度(loudness)、语音占比(speechiness)、节拍速度(tempo)、原声度(acousticness)、器乐占比(instrumentalness)、现场感(liveness)、愉悦度(valence)以及时长(duration)——如何影响歌曲的流行度?** 本项目的目标是找出音乐制作人可参考的最优音频特征区间,以提升歌曲成为爆款的概率。 --- ## 📂 数据集 | 数据集属性 | 详情 | |---|---| | **数据源** | [Kaggle平台 — 30000首Spotify歌曲](https://www.kaggle.com/datasets/joebeachcapital/30000-spotify-songs) | | **原始规模** | 32833条数据 × 23个特征 | | **最终规模** | 经数据清洗与特征工程后,共32790条数据 × 27个特征 | | **目标变量** | `歌曲流行度(track_popularity)`,取值范围为0~100 | | **爆款定义** | `歌曲流行度>75`,即所有歌曲中排名前约8.1%的曲目 | ### 音频特征说明 | 特征名称 | 含义说明 | 示例 | |---|---| | `舞蹈适配度(danceability)` | 歌曲适配舞蹈的难易程度,取值范围0.0~1.0 | 低分值对应慢速抒情曲,高分值对应夜店热单 | | `能量值(energy)` | 感知层面的强度与活跃度,取值范围0.0~1.0 | 死亡金属类曲目分值较高,巴赫前奏曲分值较低 | | `响度(loudness)` | 平均响度,单位为分贝(dB) | 数值越接近0则响度越高,-60dB则几乎无声 | | `语音占比(speechiness)` | 曲目中语音内容的占比,取值范围0.0~1.0 | >0.66则以语音为主,<0.33则以音乐为主 | | `原声度(acousticness)` | 曲目为原声演奏的置信度,取值范围0.0~1.0 | 不插电吉他曲目分值较高,电子舞曲合成器曲目分值较低 | | `器乐占比(instrumentalness)` | 曲目中无人声的概率,取值范围0.0~1.0 | 爵士器乐演奏曲目分值较高,流行人声曲目分值较低 | | `现场感(liveness)` | 曲目中存在现场观众的概率,取值范围0.0~1.0 | >0.8则大概率为现场录音 | | `愉悦度(valence)` | 乐曲的积极情绪程度,取值范围0.0~1.0 | 高分值对应欢快愉悦的曲目,低分对应悲伤或愤怒的曲目 | | `节拍速度(tempo)` | 估算的节拍速度,单位为每分钟节拍数(BPM) | 60BPM为慢速,180BPM为极快 | | `时长(duration_min)` | 歌曲时长,单位为分钟 | 2分钟为短时长曲目,6分钟为长时长曲目 | --- ## 🛠️ 环境配置 python from google.colab import drive import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns 本项目基于**Google Colab**开发,数据集从Google云端硬盘加载。 --- ## 🧹 数据清洗 **缺失值处理**:共发现5条存在`歌曲名称(track_name)`、`歌手名称(track_artist)`以及`专辑名称(track_album_name)`缺失的记录,已全部删除。 **重复值处理**:未发现重复记录,无需操作。 **异常值处理**: - `节拍速度为0 BPM`:删除1条记录,0 BPM的歌曲不符合现实逻辑。 - `响度>0 dB`:删除6条记录,行业标准规定音频响度上限为0 dB。 **清洗后最终数据规模**:(32790, 27) --- ## ⚙️ 特征工程 本项目共构建3个新特征: - **`发行年份(release_year)`**:从`专辑发行日期(track_album_release_date)`中提取,覆盖年份范围为1957~2020年 - **`是否为爆款(is_hit)`**:当`歌曲流行度>75`时取值为`True`,对应按流行度排名前约8.1%的曲目 - **`标准化响度(loudness_scaled)`**:将以dB为单位的响度(范围-60~0)标准化至0~100区间,提升可读性 --- ## 📊 可视化结果与核心发现 ### 1. 异常值检测——10项音频特征全量分析 ![异常值检测图](https://huggingface.co/datasets/Noam12345/my_first_asignment_ds/resolve/main/PHOTOS_NEW/OUTLIERS.png) 本次可视化采用箱线图展示10项音频特征的分布与异常值情况,核心观测结果如下: - **节拍速度**:主要集中在100~130 BPM区间 - **器乐占比**:数值几乎全部集中在0附近,约90%的曲目包含人声 - **语音占比与现场感**:呈显著右偏分布,绝大多数曲目为录音室录制,语音内容占比极低 - **愉悦度**:分布均匀,体现了快乐与悲伤情绪音乐的合理占比 所有异常值均**予以保留**:极端值(如10分钟时长的歌曲、纯器乐曲目)属于合理的艺术创作选择,并非数据错误。 --- ### 2. 爆款阈值敏感性测试 ![爆款阈值测试图](https://huggingface.co/datasets/Noam12345/my_first_asignment_ds/resolve/main/PHOTOS_NEW/HIT_RATE_THRESH.png) 为客观定义“爆款”歌曲,本项目测试了60~85区间内的所有流行度阈值,测试结果如下: | 流行度阈值 | 爆款占比 | 判定结果 | |---|---| | 60 | 27.2% | 阈值过宽——超过四分之一的曲目无法被归类为非爆款,不符合逻辑 | | 65 | 20.1% | 仍过于宽泛 | | 70 | 13.5% | 逐步趋近合理区间 | | **75** | **8.1%** | ✅ 选定阈值——属于合理的小众优质群体,样本量足以支撑分析 | | 80 | 4.1% | 阈值过严——丢失过多有效样本 | | 85 | 1.9% | 仅能覆盖顶级爆款曲目 | 最终选定阈值为75,为最优平衡点。 --- ### 3. 流行度分布与爆款/非爆款样本划分 ![流行度分布图](https://huggingface.co/datasets/Noam12345/my_first_asignment_ds/resolve/main/PHOTOS_NEW/HIT_THRESH_VALID.png) 流行度分布曲线显示,70分以上出现自然衰减,验证了75分恰好处于顶级热门歌曲的边缘区间。最终样本划分结果:**爆款曲目2657首(占比8.1%)** 与 **非爆款曲目30133首(占比91.9%)**,即每1首爆款歌曲对应约11首非爆款歌曲。 --- ### 4. 爆款与非爆款曲目——特征中位数对比 ![特征中位数对比图](https://huggingface.co/datasets/Noam12345/my_first_asignment_ds/resolve/main/PHOTOS_NEW/MEDIAN_COMP.png) 将10项音频特征标准化至0~1区间后,对比爆款与非爆款曲目的中位数差异: - **↑ 舞蹈适配度**:爆款曲目得分普遍更高,适配舞蹈的节奏至关重要 - **↑ 响度**:爆款曲目响度更高,精细化的母带处理为行业通行做法 - **↑ 原声度**:爆款曲目具备略高的原声质感 - **↑ 愉悦度**:爆款曲目整体更偏向积极情绪,治愈系音乐在流媒体平台更具优势 - **↓ 语音占比**:爆款曲目旋律性更强,语音内容占比更低 - **↓ 器乐占比**:爆款曲目几乎均包含人声 - **↓ 现场感**:爆款曲目均为录音室录制,而非现场演出录音 - **↓ 节拍速度**:爆款曲目节奏略缓(平均约118 BPM,整体平均为122 BPM) - **↓ 时长**:爆款曲目更短,流媒体平台更青睐紧凑的歌曲长度 - **~ 能量值**:两组差异不大,爆款曲目能量值略低 --- ### 5. 音频特征相关性热力图 ![相关性热力图](https://huggingface.co/datasets/Noam12345/my_first_asignment_ds/resolve/main/PHOTOS_NEW/HEATMAP.png) 热力图揭示了以下关键关联: - **能量值 ↔ 响度(相关系数r=+0.68)**:相关性最强的特征组合,响度越高的歌曲感知能量也越强 - **能量值 ↔ 原声度(r=-0.54)**:原声曲目能量值更低,感知更平缓 - **响度 ↔ 原声度(r=-0.36)**:响度越高的歌曲原声占比越低 - **舞蹈适配度 ↔ 愉悦度(r=+0.33)**:积极情绪的歌曲通常更适配舞蹈 - **器乐占比 ↔ 流行度(r=-0.15)**:相关性最显著的特征对,含人声的歌曲更易获得高流行度 - **时长 ↔ 流行度(r=-0.14)**:歌曲越短,流行度表现越好 所有其他特征与流行度的相关系数均低于0.10,不存在单一的“爆款万能预测特征”。 --- ### 6. 音频特征与歌曲流行度——散点图分析 ![散点图分析](https://huggingface.co/datasets/Noam12345/my_first_asignment_ds/resolve/main/PHOTOS_NEW/SCATER.png) 本部分将每项特征与流行度绘制散点图,并添加线性趋势线(标注相关系数r值): - **器乐占比(r=-0.15)**与**时长(r=-0.14)**呈现最显著的负相关趋势 - **原声度(r=+0.09)**为相关性最强的正向预测特征,适度的原声质感可提升流行度 - **节拍速度(r≈0.00)**与流行度几乎无关联,单独的节拍速度无法决定歌曲流行度 - **能量值、语音占比、现场感、愉悦度**均呈现弱相关但方向一致的趋势。 --- ### 7. 音乐流行趋势随时间变化 ![时间趋势图](https://huggingface.co/datasets/Noam12345/my_first_asignment_ds/resolve/main/PHOTOS_NEW/TIME.png) 爆款占比在20世纪60年代中期达到峰值(约17%),并在2015年后再次攀升,至2019年达到约14%。 **近年歌曲流行度得分更高的原因**:Spotify的流行度评分采用近期加权机制,优先统计近期高流量流媒体播放量,而非总播放量。结合TikTok、Instagram Reels等社交媒体的病毒式传播效应,近年歌曲的流媒体播放量积累速度远高于过往 decades 的歌曲。该现象为统计机制带来的偏差,并非现代音乐质量更优的客观证据。 --- ### 8. 不同曲风的爆款占比 ![曲风爆款占比图](https://huggingface.co/datasets/Noam12345/my_first_asignment_ds/resolve/main/PHOTOS_NEW/GENERE.png) | 曲风 | 爆款占比 | 说明 | |---|---| | **流行(Pop)** | **0.157** | 爆款占比约为整体均值的2倍——该曲风专为大众受众打造 | | **拉丁(Latin)** | **0.138** | 全球拉丁音乐热潮(如Bad Bunny、J Balvin的作品)在数据中体现明显 | | **节奏布鲁斯(R&B)** | 0.098 | 高于整体均值——拥有稳定的流媒体受众群体 | | **说唱(Rap)** | 0.060 | 低于整体均值 | | **电子舞曲(EDM)** | 0.044 | 爆款占比最低——电子舞曲的成功更多体现在音乐节现场,而非Spotify流媒体播放 | --- ### 9. 不同曲风的最优特征区间——核密度估计分析 ![曲风最优特征区间图](https://huggingface.co/datasets/Noam12345/my_first_asignment_ds/resolve/main/PHOTOS_NEW/SWEET_SPOT.png) 核密度估计(KDE)图展示了**流行、拉丁、节奏布鲁斯**三类曲风的爆款曲目在10项音频特征上的聚集区间: **所有曲风通用的最优区间**: - 能量值处于0.6~0.8区间、响度较高、器乐占比与语音占比接近0、现场感较低、时长为3~4分钟 **曲风专属差异**: - **节拍速度**:流行与R&B曲风的峰值区间为95~125 BPM;拉丁曲风则在约175 BPM处出现第二峰值(对应雷鬼顿曲风) - **舞蹈适配度**:拉丁曲风要求最高(约0.78),R&B曲风的适配度区间更宽泛 - **愉悦度**:流行与拉丁曲风偏向积极情绪,R&B曲风的情绪跨度更广 - **原声度**:R&B曲风可容纳更多原声元素,流行曲风则相对更少 --- ## ✅ 总结与结论 | 特征 | 影响方向 | 爆款最优区间 | |---|---| | 舞蹈适配度 | 正向相关 | 0.70~0.85 | | 能量值 | 无显著相关性 | 0.60~0.80 | | 响度 | 正向相关 | 85~90(标准化后) | | 语音占比 | 负向相关 | <0.10 | | 节拍速度 | 无显著相关性 | 随曲风变化 | | 原声度 | 弱正向相关 | 略高于整体均值 | | 器乐占比 | 强负向相关 | 接近0(人声为必备元素) | | 现场感 | 负向相关 | <0.15(仅录音室录制曲目) | | 愉悦度 | 弱正向相关 | 略高于整体均值 | | 时长 | 负向相关 | 3~4分钟 | **曲风差异的影响远大于单一特征**:流行与拉丁曲风的爆款产出率约为电子舞曲的2倍。同一曲风内的差异化核心在于节拍速度与愉悦度,这也是曲风风格的核心体现。 --- ## 👤 作者 **Noam Fuchs** — 数据科学作业1,2026年3月
提供机构:
Noam12345
搜集汇总
数据集介绍
main_image_url
构建方式
在音乐信息检索领域,数据集的质量直接影响分析结论的可靠性。本数据集源自Kaggle平台的“30,000 Spotify Songs”原始集合,初始包含32,833首歌曲的23项特征。构建过程经历了严谨的数据清洗与特征工程:首先移除了缺失关键元数据的记录,并剔除了不符合物理现实的数据点(如零拍速或异常响度值)。随后,通过提取发行年份、定义热门歌曲阈值(流行度>75)以及对响度与时长进行重新标度,新增了四项衍生特征。最终,通过舍弃冗余的日期与子流派列,形成了一个包含32,790首歌曲、25个特征的洁净表格数据集,为后续探索性分析奠定了坚实基础。
使用方法
该数据集适用于音乐流行度预测、音频特征分析与音乐信息检索等多个研究方向。使用者可首先加载数据,利用提供的音频特征作为自变量,以“track_popularity”连续值或“is_hit”二分类作为预测目标,构建回归或分类模型。在探索性分析中,可借鉴原研究提供的可视化方法,如通过箱线图检测特征分布,利用热图分析特征间相关性,或通过核密度估计探索不同流派热门歌曲的特征“甜蜜点”。需注意数据中存在的时效性偏差,即近期歌曲因流媒体平台算法而具有更高的流行度基线,在建模时应考虑年份作为协变量或进行分层抽样以确保结论的稳健性。
背景与挑战
背景概述
在音乐信息检索与计算音乐学领域,预测歌曲的流行度是一个长期存在的核心研究问题。my_first_asignment_ds数据集源于2026年的一项学术作业,由研究者Noam Fuchs构建,其基础数据来自Kaggle平台的“30,000 Spotify Songs”公开集合。该数据集旨在系统探究Spotify音频特征(如可舞性、能量、响度、效价等)与歌曲流行度之间的量化关系,从而为音乐制作提供数据驱动的见解。通过严谨的特征工程与可视化分析,该研究不仅深化了对“热门歌曲”音频特征模式的理解,也为音乐产业的数据化决策提供了实证参考。
当前挑战
该数据集致力于解决音乐流行度预测这一复杂问题,其核心挑战在于流行度本身是一个受文化、时代、平台算法与用户行为多重影响的动态概念,难以仅通过静态音频特征完全捕捉。在构建过程中,研究者面临数据质量与定义的双重挑战:一方面需处理原始数据中的缺失值、异常值(如零BPM或超限响度)并合理定义“热门”阈值;另一方面,需克服数据固有的偏差,例如Spotify流行度评分对近期歌曲的加权效应,以及数据集中不同年代与流派歌曲数量分布不均的问题,这些因素都可能影响模型泛化能力与结论的普适性。
常用场景
经典使用场景
在音乐信息检索与计算音乐学领域,该数据集常被用于探索音频特征与歌曲流行度之间的量化关系。研究者通过分析如舞蹈性、能量、响度、声学性等十项音频特征,构建预测模型以识别热门歌曲的潜在模式。经典应用场景包括利用监督学习算法,如逻辑回归或随机森林,对歌曲是否成为热门进行二分类预测,从而揭示音乐制作中可优化的声学参数范围。
解决学术问题
该数据集有效解决了音乐流行度预测中的特征重要性分析问题,为理解哪些音频属性与商业成功相关提供了实证基础。其意义在于突破了传统音乐研究的主观定性局限,通过大规模数据驱动方法,揭示了如乐器性负相关、时长负相关等统计规律,影响了音乐推荐系统、自动化音乐制作及文化趋势量化分析等多个学术方向。
实际应用
在音乐产业实践中,该数据集为制作人与流媒体平台提供了数据驱动的决策支持。制作人可依据特征甜点区间调整编曲与混音策略,例如提高舞蹈性、控制时长在3至4分钟,以增强歌曲的市场竞争力。平台则可借此优化推荐算法,精准推送符合流行趋势的内容,提升用户参与度与留存率。
数据集最近研究
最新研究方向
在音乐信息检索领域,基于Spotify音频特征的数据集正推动着音乐流行度预测研究的深化。当前研究焦点已从单一特征分析转向多模态融合与动态建模,结合流媒体平台的实时反馈机制,探索短期爆发与长期经典之间的特征差异。随着生成式人工智能在音乐创作中的应用,学者们开始利用此类数据集训练模型,以识别新兴流派如Hyperpop或Drill中的成功模式,同时关注算法公平性,避免推荐系统对特定文化或风格产生偏见。这些探索不仅为音乐产业提供了数据驱动的创作指南,也引发了关于艺术价值与数据指标之间平衡的伦理讨论。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作