five

vancenceho/youtube-features-clean

收藏
Hugging Face2026-04-20 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/vancenceho/youtube-features-clean
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cdla-sharing-1.0 language: - en tags: - music - code pretty_name: YouTube Audio Features Cleaned size_categories: - 10K<n<100K --- # YouTube Audio Features Cleaned Tabular **YouTube-side features** for tracks that have been matched to **Spotify** records in the *viral-content-predictor* pipeline. Rows are keyed for alignment with Spotify / audio / lyrics tables; columns combine metadata and numeric signals used for engagement or popularity modeling (exact schema depends on the exporting notebook revision). ## File | File | Role | |------|------| | `youtube_features_cleaned.csv` | One row per matched track; cleaned dtypes and selected columns for baseline / ensemble notebooks. | ## Source - **Upstream:** YouTube metadata and/or engagement fields collected after resolving **YouTube ↔ Spotify** matches (see notebooks under `notebooks/` for matching, downloads, and feature assembly). - **Output location:** `data/processed/` — consumed e.g. by `02e_youtube_model_baseline.ipynb` and `explore_youtube_engagement_ensemble.ipynb` as `../data/processed/youtube_features_cleaned.csv`. ## Schema (typical) Expect on the order of **tens to low hundreds of columns**, including: - Identifiers and join keys (e.g. `track_id` or aligned IDs) to link to Spotify and other project tables. - **YouTube**-derived fields (titles, IDs, timestamps, engagement statistics where present). - Engineered or cleaned numeric/categorical features ready for `scikit-learn` pipelines. Exact names and counts change with pipeline versions; inspect with: ```python import pandas as pd df = pd.read_csv("youtube_features_cleaned.csv", nrows=5) print(df.columns.tolist()) ``` ## Modeling notes - Baseline notebooks may **drop engagement targets** from `X` to avoid leakage when predicting virality from other features — follow the notebook you use. - Train/validation splits should respect any **group** or **time** structure your project defines. ## Usage ```python import pandas as pd df = pd.read_csv("youtube_features_cleaned.csv") ``` ## Limitations - **Snapshot:** reflects the crawl / export date, not live YouTube counts. - **Coverage:** only tracks with a successful match and feature row in the pipeline. - **License:** comply with **CDLA-Sharing-1.0** (this card), YouTube’s terms, and third-party dataset licenses you combined to build the file. ## Citation Cite this repository and the YouTube / Spotify data sources you used, plus the notebook hash or release tag that produced the CSV.
提供机构:
vancenceho
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作