DIY Repair Videos: A Multimodal YouTube Dataset for Instructional Content Analysis
收藏DataCite Commons2025-05-01 更新2025-05-17 收录
下载链接:
https://data.mendeley.com/datasets/8x3xkkymzx
下载链接
链接失效反馈官方服务:
资源简介:
DIY Repair Videos: A Multimodal YouTube Dataset for Instructional Content Analysis
Overview:
This dataset consists of 6,015 YouTube videos focused on do‑it‑yourself fix-it repair tutorials. Each video is accompanied by structured metadata, natural language transcripts, community engagement signals, and inferred instructional categories. The dataset is designed to support research in multimodal content understanding, instructional video analysis, and human-computer interaction. For every video, we've included:
Metadata: Titles, descriptions, durations (ISO 8601 & seconds), view/like/comment counts, published dates, and thumbnail URLs.
Transcripts: Auto‑captions (blank if not available), labeled as "instructional," "musical," "ambiguous," or "unknown".
DIY Categories: Zero‑shot labels (e.g., home repair, plumbing) from a BART‑MNLI model.
Comments: Up to 50 top‑level comments per video (JSON), along with a Boolean flag (Has_Comments).
Channel Info: Uploader ID, title, and thumbnail URL.
Engagement Metric: Calculated like-to-view ratio (Like_Count / (View_Count + 1)).
Data Collection Methods:
Video Discovery: Selenium-based scraping mimicking real-time user scrolling to dynamically load YouTube search results across varied queries.
Metadata & Comments: Batches (max 50 videos per request) retrieved using the YouTube Data API.
Transcripts: Retrieved using youtube_transcript_api. Videos with no captions are marked, with future work planned for Whisper-based ASR.
Classification:
Transcript type identified using heuristics (entropy, instructional keywords).
DIY domain classification performed using zero-shot learning.
Deduplication: Has a live set of processed video IDs to make dataset growth incremental and distinctive.
DIY-Repair-Youtube-Dataset/
│
├── data/
│ ├── video_metadata.csv # Main dataset file (6,015 rows × 19 columns)
│ └── data_dictionary.csv # Definitions of each column/field
│
├── code/
│ ├── 00_setup_environment.sh
│ ├── 01_data_collection.py
│ ├── 02_filter_new_videos.py
│ ├── 03_metadata_extraction.py
│ ├── 04_transcript_classification.py
│ ├── 05_diy_category_tagging.py
│ ├── 06_comments_channel_info.py
│ ├── 07_process_video_record.py
│ └── 08_batch_processing.py
│ └── utils.py
│
├── CITATION.cff
├── LICENSE
└── README.md
└── requirements.txt
#The dataset is annotated manually and rebviewed the transcripts.
提供机构:
Mendeley Data
创建时间:
2025-04-23



