vancenceho/youtube-features-clean
收藏Hugging Face2026-04-20 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/vancenceho/youtube-features-clean
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cdla-sharing-1.0
language:
- en
tags:
- music
- code
pretty_name: YouTube Audio Features Cleaned
size_categories:
- 10K<n<100K
---
# YouTube Audio Features Cleaned
Tabular **YouTube-side features** for tracks that have been matched to **Spotify** records in the *viral-content-predictor* pipeline. Rows are keyed for alignment with Spotify / audio / lyrics tables; columns combine metadata and numeric signals used for engagement or popularity modeling (exact schema depends on the exporting notebook revision).
## File
| File | Role |
|------|------|
| `youtube_features_cleaned.csv` | One row per matched track; cleaned dtypes and selected columns for baseline / ensemble notebooks. |
## Source
- **Upstream:** YouTube metadata and/or engagement fields collected after resolving **YouTube ↔ Spotify** matches (see notebooks under `notebooks/` for matching, downloads, and feature assembly).
- **Output location:** `data/processed/` — consumed e.g. by `02e_youtube_model_baseline.ipynb` and `explore_youtube_engagement_ensemble.ipynb` as `../data/processed/youtube_features_cleaned.csv`.
## Schema (typical)
Expect on the order of **tens to low hundreds of columns**, including:
- Identifiers and join keys (e.g. `track_id` or aligned IDs) to link to Spotify and other project tables.
- **YouTube**-derived fields (titles, IDs, timestamps, engagement statistics where present).
- Engineered or cleaned numeric/categorical features ready for `scikit-learn` pipelines.
Exact names and counts change with pipeline versions; inspect with:
```python
import pandas as pd
df = pd.read_csv("youtube_features_cleaned.csv", nrows=5)
print(df.columns.tolist())
```
## Modeling notes
- Baseline notebooks may **drop engagement targets** from `X` to avoid leakage when predicting virality from other features — follow the notebook you use.
- Train/validation splits should respect any **group** or **time** structure your project defines.
## Usage
```python
import pandas as pd
df = pd.read_csv("youtube_features_cleaned.csv")
```
## Limitations
- **Snapshot:** reflects the crawl / export date, not live YouTube counts.
- **Coverage:** only tracks with a successful match and feature row in the pipeline.
- **License:** comply with **CDLA-Sharing-1.0** (this card), YouTube’s terms, and third-party dataset licenses you combined to build the file.
## Citation
Cite this repository and the YouTube / Spotify data sources you used, plus the notebook hash or release tag that produced the CSV.
提供机构:
vancenceho



