vancenceho/youtube-features-clean

Name: vancenceho/youtube-features-clean
Creator: vancenceho
Published: 2026-04-20 12:34:36
License: 暂无描述

Hugging Face2026-04-20 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/vancenceho/youtube-features-clean

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cdla-sharing-1.0 language: - en tags: - music - code pretty_name: YouTube Audio Features Cleaned size_categories: - 10K<n<100K --- # YouTube Audio Features Cleaned Tabular **YouTube-side features** for tracks that have been matched to **Spotify** records in the *viral-content-predictor* pipeline. Rows are keyed for alignment with Spotify / audio / lyrics tables; columns combine metadata and numeric signals used for engagement or popularity modeling (exact schema depends on the exporting notebook revision). ## File | File | Role | |------|------| | `youtube_features_cleaned.csv` | One row per matched track; cleaned dtypes and selected columns for baseline / ensemble notebooks. | ## Source - **Upstream:** YouTube metadata and/or engagement fields collected after resolving **YouTube ↔ Spotify** matches (see notebooks under `notebooks/` for matching, downloads, and feature assembly). - **Output location:** `data/processed/` — consumed e.g. by `02e_youtube_model_baseline.ipynb` and `explore_youtube_engagement_ensemble.ipynb` as `../data/processed/youtube_features_cleaned.csv`. ## Schema (typical) Expect on the order of **tens to low hundreds of columns**, including: - Identifiers and join keys (e.g. `track_id` or aligned IDs) to link to Spotify and other project tables. - **YouTube**-derived fields (titles, IDs, timestamps, engagement statistics where present). - Engineered or cleaned numeric/categorical features ready for `scikit-learn` pipelines. Exact names and counts change with pipeline versions; inspect with: ```python import pandas as pd df = pd.read_csv("youtube_features_cleaned.csv", nrows=5) print(df.columns.tolist()) ``` ## Modeling notes - Baseline notebooks may **drop engagement targets** from `X` to avoid leakage when predicting virality from other features — follow the notebook you use. - Train/validation splits should respect any **group** or **time** structure your project defines. ## Usage ```python import pandas as pd df = pd.read_csv("youtube_features_cleaned.csv") ``` ## Limitations - **Snapshot:** reflects the crawl / export date, not live YouTube counts. - **Coverage:** only tracks with a successful match and feature row in the pipeline. - **License:** comply with **CDLA-Sharing-1.0** (this card), YouTube’s terms, and third-party dataset licenses you combined to build the file. ## Citation Cite this repository and the YouTube / Spotify data sources you used, plus the notebook hash or release tag that produced the CSV.

提供机构：

vancenceho

5,000+

优质数据集

54 个

任务类型

进入经典数据集