vancenceho/vcp-combined-features
收藏Hugging Face2026-04-20 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/vancenceho/vcp-combined-features
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cdla-sharing-1.0
language:
- en
tags:
- music
- code
pretty_name: Spotify-YouTube Combined Ensemble Features
size_categories:
- 10K<n<100K
---
# Spotify–YouTube Combined Ensemble Features
A **single-table, modeling-ready** CSV that **joins Spotify track metadata**, **librosa audio features** (from matched YouTube audio), and **YouTube engagement** fields on a common key (`track_id`). Built for **ensemble / viral prediction** experiments in the *viral-content-predictor* project (e.g. `03_combined_model_training.ipynb`).
## File
| File | Role |
|------|------|
| `combined_features_cleaned.csv` | One row per track (after pipeline joins); mixed numeric, categorical encodings, and labels. |
## Typical contents (schema evolves with the pipeline)
A representative export has on the order of **~10⁴–10⁵ rows** and **~100+ columns**, including:
| Group | Examples |
|-------|----------|
| **IDs / text** | `track_id`, `track_name`, `artists` |
| **Spotify** | `popularity`, `loudness`, `valence`, `danceability`, `energy`, `tempo_spotify`, `speechiness`, `liveness`, `acousticness`, `instrumentalness`, one-hot `explicit_*`, `mode_*`, `time_signature_*` |
| **Audio (librosa)** | Spectral, MFCC, chroma, tonnetz, onset, ZCR, `tempo_librosa`, etc. |
| **YouTube engagement** | `view_count`, `like_count`, `comment_count`, derived rates such as `like_rate`, `comment_rate` |
| **Targets** | `viral` (binary), `virality_score` (or project-specific label columns) |
Exact names and counts depend on the notebook version that produced the file—inspect with:
```python
import pandas as pd
df = pd.read_csv("combined_features_cleaned.csv", nrows=5)
print(df.shape[1], "columns")
print(df.columns.tolist())
```
## Provenance
- **Assembled** from cleaned Spotify tables, **YouTube**-aligned metadata/features, and **audio feature** extractions already aligned in the project DAG.
- **Consumers:** `notebooks/03_combined_model_training.ipynb`, `notebooks/exploratory/explore_combined_enesmble_voting.ipynb`, etc., reading `data/processed/combined_features_cleaned.csv`.
## Modeling notes (leakage)
The table may include **YouTube engagement** columns that **directly relate** to how “viral” was defined. For many experiments those columns are **excluded from `X`** when training the viral classifier so labels are not trivially predictable—see the **exclude list** in `03_combined_model_training.ipynb`. Keep or drop columns according to your task.
## Usage
```python
import pandas as pd
df = pd.read_csv("combined_features_cleaned.csv")
```
## Limitations
- **Snapshot:** reflects the pipeline run that produced it, not live Spotify/YouTube.
- **License / rights:** comply with **CDLA-Sharing-1.0** (this card), Spotify and YouTube terms, and any third-party dataset licenses you merged.
## Citation
Cite this repository and the specific notebook revision or release tag used to build the CSV.
> 许可:CDLA共享许可1.0(CDLA-Sharing-1.0)
> 语言:英语
> 标签:音乐、代码
> 展示名称:Spotify-YouTube联合集成特征集
> 样本量范围:10000 < 样本量 < 100000
---
# Spotify–YouTube联合集成特征集(Spotify–YouTube Combined Ensemble Features)
这是一个**单表式、可直接用于建模**的逗号分隔值文件(CSV),通过公共键`track_id`(轨道ID)将**Spotify曲目元数据**、**匹配自YouTube音频的librosa音频特征**以及**YouTube互动数据字段**进行关联。本数据集专为*viral-content-predictor*项目中的**集成/病毒式传播预测**实验构建(例如`03_combined_model_training.ipynb`)。
## 文件信息
| 文件 | 作用 |
|------|------|
| `combined_features_cleaned.csv` | 每条曲目对应一行(经流水线关联处理后);包含混合的数值型、分类编码型数据与标签。 |
## 典型内容(数据集结构随流水线迭代更新)
典型导出的样本量约为**10⁴~10⁵行**,列数超过100列,包含以下几类数据:
| 数据组 | 示例字段 |
|-------|----------|
| **标识符与文本** | `track_id`(轨道ID)、`track_name`(曲目名称)、`artists`(艺术家列表) |
| **Spotify相关特征** | `popularity`(流行度)、`loudness`(响度)、`valence`(情绪效价)、`danceability`(舞蹈适配度)、`energy`(能量值)、`tempo_spotify`(Spotify提取速度)、`speechiness`(口语化程度)、`liveness`(现场感)、`acousticness`(原声性)、`instrumentalness`(器乐性),以及独热编码形式的`explicit_*`、`mode_*`、`time_signature_*` |
| **音频特征(librosa提取)** | 频谱特征、MFCC(梅尔频率倒谱系数)、色度特征、tonnetz特征、onset(音频起始点)、ZCR(过零率)、`tempo_librosa`(librosa提取速度)等 |
| **YouTube互动数据** | `view_count`(播放量)、`like_count`(点赞数)、`comment_count`(评论数),以及衍生指标如`like_rate`(点赞率)、`comment_rate`(评论率) |
| **目标标签** | `viral`(二分类标签,是否为病毒式传播曲目)、`virality_score`(病毒传播评分,或项目专属标签列) |
字段的具体名称与数量取决于生成该文件的Notebook版本,可通过以下代码查看详情:
python
import pandas as pd
df = pd.read_csv("combined_features_cleaned.csv", nrows=5)
print(df.shape[1], "columns")
print(df.columns.tolist())
## 数据集溯源
- **组装来源**:由已清洗的Spotify数据表、与YouTube对齐的元数据/特征,以及项目有向无环图(DAG)中已完成对齐的音频特征提取结果整合而成。
- **使用方**:`notebooks/03_combined_model_training.ipynb`、`notebooks/exploratory/explore_combined_ensemble_voting.ipynb`等脚本,均通过读取`data/processed/combined_features_cleaned.csv`使用该数据集。
## 建模注意事项(数据泄露)
该数据表可能包含**与“病毒式传播”定义直接相关的YouTube互动字段**。在多数实验中,训练病毒分类器时会**将这些字段从特征矩阵`X`中排除**,以避免标签被轻易预测——详见`03_combined_model_training.ipynb`中的排除列表。可根据自身任务需求保留或删去对应列。
## 使用方法
python
import pandas as pd
df = pd.read_csv("combined_features_cleaned.csv")
## 局限性
- **快照属性**:仅反映生成该数据集的流水线运行结果,而非Spotify/YouTube的实时数据。
- **许可与版权**:需遵守**CDLA-Sharing-1.0(本数据集卡片)**、Spotify与YouTube的服务条款,以及所合并的第三方数据集的许可协议。
## 引用说明
请引用本代码仓库,以及用于构建该CSV文件的特定Notebook版本或发布标签。
提供机构:
vancenceho



