vancenceho/youtube-spotify-audio-features
收藏Hugging Face2026-04-20 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/vancenceho/youtube-spotify-audio-features
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cdla-sharing-1.0
language:
- en
tags:
- music
pretty_name: Spotify-YouTube Audio Features
size_categories:
- 10K<n<100K
---
# Spotify–YouTube Audio Features
Tabular **librosa** audio features for tracks aligned with the Spotify / YouTube pipeline in the *viral-content-predictor* project. Each row is one Spotify `track_id` matched to a downloaded YouTube audio clip; features are aggregated statistics (mean / std) computed on the decoded waveform.
## Files
| File | Description |
|------|-------------|
| `audio_features.csv` | One row per track: `track_id`, 89 derived feature dimensions (means/stds), `extraction_success`, `error_message`. |
## Data instance
- **Format:** CSV (header row).
- **Rows:** on the order of **tens of thousands** of tracks (exact count may change as the extraction job is updated).
- **Key column:** `track_id` — Spotify track identifier, join key to other project tables.
- **Label / target:** not included; this file is **features only**.
## Feature groups (89 dimensions)
Features are extracted with **librosa** from MP3s under the project’s YouTube-audio download step (`02b_youtube_audio_feature_extraction.ipynb`). Groups include:
- Tempo (1)
- RMS energy (2)
- Zero crossing rate (2)
- Spectral centroid, rolloff, bandwidth (6)
- Spectral contrast — 7 bands (14)
- MFCC — 13 coefficients (26)
- Chroma — 12 pitch classes (24)
- Tonnetz (12)
- Onset strength (2)
Most groups contribute **mean** and **std** columns; see the CSV header for exact names.
Additional columns:
- **`extraction_success`** — boolean-like flag indicating whether feature extraction completed for that file.
- **`error_message`** — empty or diagnostic text when extraction failed.
## Provenance
- **Audio source:** user-downloaded YouTube audio (`.mp3`) matched to Spotify metadata elsewhere in the pipeline.
- **Processing:** Python, **librosa**; optional multiprocessing for throughput.
- **This artifact:** not the raw audio; only numeric summaries suitable for modeling and joins.
## Usage
```python
import pandas as pd
df = pd.read_csv("audio_features.csv")
# Join on track_id with Spotify / YouTube tables in your project
```
## Limitations
- Coverage is limited to tracks with a successful YouTube match and a readable audio file.
- Feature definitions follow librosa defaults; hyperparameters (e.g. hop length, FFT size) are fixed in the extraction notebook.
## License
This dataset card specifies **CDLA-Sharing-1.0**. Ensure your redistribution of the CSV and any derived data complies with that license and with the licenses of the underlying **Spotify** and **YouTube** data you used to build the file.
提供机构:
vancenceho



