vancenceho/vcp-combined-features

Name: vancenceho/vcp-combined-features
Creator: vancenceho
Published: 2026-04-20 12:39:32
License: 暂无描述

Hugging Face2026-04-20 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/vancenceho/vcp-combined-features

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cdla-sharing-1.0 language: - en tags: - music - code pretty_name: Spotify-YouTube Combined Ensemble Features size_categories: - 10K<n<100K --- # Spotify–YouTube Combined Ensemble Features A **single-table, modeling-ready** CSV that **joins Spotify track metadata**, **librosa audio features** (from matched YouTube audio), and **YouTube engagement** fields on a common key (`track_id`). Built for **ensemble / viral prediction** experiments in the *viral-content-predictor* project (e.g. `03_combined_model_training.ipynb`). ## File | File | Role | |------|------| | `combined_features_cleaned.csv` | One row per track (after pipeline joins); mixed numeric, categorical encodings, and labels. | ## Typical contents (schema evolves with the pipeline) A representative export has on the order of **~10⁴–10⁵ rows** and **~100+ columns**, including: | Group | Examples | |-------|----------| | **IDs / text** | `track_id`, `track_name`, `artists` | | **Spotify** | `popularity`, `loudness`, `valence`, `danceability`, `energy`, `tempo_spotify`, `speechiness`, `liveness`, `acousticness`, `instrumentalness`, one-hot `explicit_*`, `mode_*`, `time_signature_*` | | **Audio (librosa)** | Spectral, MFCC, chroma, tonnetz, onset, ZCR, `tempo_librosa`, etc. | | **YouTube engagement** | `view_count`, `like_count`, `comment_count`, derived rates such as `like_rate`, `comment_rate` | | **Targets** | `viral` (binary), `virality_score` (or project-specific label columns) | Exact names and counts depend on the notebook version that produced the file—inspect with: ```python import pandas as pd df = pd.read_csv("combined_features_cleaned.csv", nrows=5) print(df.shape[1], "columns") print(df.columns.tolist()) ``` ## Provenance - **Assembled** from cleaned Spotify tables, **YouTube**-aligned metadata/features, and **audio feature** extractions already aligned in the project DAG. - **Consumers:** `notebooks/03_combined_model_training.ipynb`, `notebooks/exploratory/explore_combined_enesmble_voting.ipynb`, etc., reading `data/processed/combined_features_cleaned.csv`. ## Modeling notes (leakage) The table may include **YouTube engagement** columns that **directly relate** to how “viral” was defined. For many experiments those columns are **excluded from `X`** when training the viral classifier so labels are not trivially predictable—see the **exclude list** in `03_combined_model_training.ipynb`. Keep or drop columns according to your task. ## Usage ```python import pandas as pd df = pd.read_csv("combined_features_cleaned.csv") ``` ## Limitations - **Snapshot:** reflects the pipeline run that produced it, not live Spotify/YouTube. - **License / rights:** comply with **CDLA-Sharing-1.0** (this card), Spotify and YouTube terms, and any third-party dataset licenses you merged. ## Citation Cite this repository and the specific notebook revision or release tag used to build the CSV.

> 许可：CDLA共享许可1.0（CDLA-Sharing-1.0） > 语言：英语 > 标签：音乐、代码 > 展示名称：Spotify-YouTube联合集成特征集 > 样本量范围：10000 < 样本量 < 100000 --- # Spotify–YouTube联合集成特征集（Spotify–YouTube Combined Ensemble Features）这是一个**单表式、可直接用于建模**的逗号分隔值文件（CSV），通过公共键`track_id`（轨道ID）将**Spotify曲目元数据**、**匹配自YouTube音频的librosa音频特征**以及**YouTube互动数据字段**进行关联。本数据集专为*viral-content-predictor*项目中的**集成/病毒式传播预测**实验构建（例如`03_combined_model_training.ipynb`）。 ## 文件信息 | 文件 | 作用 | |------|------| | `combined_features_cleaned.csv` | 每条曲目对应一行（经流水线关联处理后）；包含混合的数值型、分类编码型数据与标签。 | ## 典型内容（数据集结构随流水线迭代更新）典型导出的样本量约为**10⁴~10⁵行**，列数超过100列，包含以下几类数据： | 数据组 | 示例字段 | |-------|----------| | **标识符与文本** | `track_id`（轨道ID）、`track_name`（曲目名称）、`artists`（艺术家列表） | | **Spotify相关特征** | `popularity`（流行度）、`loudness`（响度）、`valence`（情绪效价）、`danceability`（舞蹈适配度）、`energy`（能量值）、`tempo_spotify`（Spotify提取速度）、`speechiness`（口语化程度）、`liveness`（现场感）、`acousticness`（原声性）、`instrumentalness`（器乐性），以及独热编码形式的`explicit_*`、`mode_*`、`time_signature_*` | | **音频特征（librosa提取）** | 频谱特征、MFCC（梅尔频率倒谱系数）、色度特征、tonnetz特征、onset（音频起始点）、ZCR（过零率）、`tempo_librosa`（librosa提取速度）等 | | **YouTube互动数据** | `view_count`（播放量）、`like_count`（点赞数）、`comment_count`（评论数），以及衍生指标如`like_rate`（点赞率）、`comment_rate`（评论率） | | **目标标签** | `viral`（二分类标签，是否为病毒式传播曲目）、`virality_score`（病毒传播评分，或项目专属标签列） | 字段的具体名称与数量取决于生成该文件的Notebook版本，可通过以下代码查看详情： python import pandas as pd df = pd.read_csv("combined_features_cleaned.csv", nrows=5) print(df.shape[1], "columns") print(df.columns.tolist()) ## 数据集溯源 - **组装来源**：由已清洗的Spotify数据表、与YouTube对齐的元数据/特征，以及项目有向无环图（DAG）中已完成对齐的音频特征提取结果整合而成。 - **使用方**：`notebooks/03_combined_model_training.ipynb`、`notebooks/exploratory/explore_combined_ensemble_voting.ipynb`等脚本，均通过读取`data/processed/combined_features_cleaned.csv`使用该数据集。 ## 建模注意事项（数据泄露）该数据表可能包含**与“病毒式传播”定义直接相关的YouTube互动字段**。在多数实验中，训练病毒分类器时会**将这些字段从特征矩阵`X`中排除**，以避免标签被轻易预测——详见`03_combined_model_training.ipynb`中的排除列表。可根据自身任务需求保留或删去对应列。 ## 使用方法 python import pandas as pd df = pd.read_csv("combined_features_cleaned.csv") ## 局限性 - **快照属性**：仅反映生成该数据集的流水线运行结果，而非Spotify/YouTube的实时数据。 - **许可与版权**：需遵守**CDLA-Sharing-1.0（本数据集卡片）**、Spotify与YouTube的服务条款，以及所合并的第三方数据集的许可协议。 ## 引用说明请引用本代码仓库，以及用于构建该CSV文件的特定Notebook版本或发布标签。

提供机构：

vancenceho

5,000+

优质数据集

54 个

任务类型

进入经典数据集