DIY Repair Videos: A Multimodal YouTube Dataset for Instructional Content Analysis

Name: DIY Repair Videos: A Multimodal YouTube Dataset for Instructional Content Analysis
Creator: Mendeley Data
Published: 2025-05-01 05:18:08
License: 暂无描述

DataCite Commons2025-05-01 更新2025-05-17 收录

下载链接：

https://data.mendeley.com/datasets/8x3xkkymzx

下载链接

链接失效反馈

官方服务：

资源简介：

DIY Repair Videos: A Multimodal YouTube Dataset for Instructional Content Analysis Overview: This dataset consists of 6,015 YouTube videos focused on do‑it‑yourself fix-it repair tutorials. Each video is accompanied by structured metadata, natural language transcripts, community engagement signals, and inferred instructional categories. The dataset is designed to support research in multimodal content understanding, instructional video analysis, and human-computer interaction. For every video, we've included: Metadata: Titles, descriptions, durations (ISO 8601 & seconds), view/like/comment counts, published dates, and thumbnail URLs. Transcripts: Auto‑captions (blank if not available), labeled as "instructional," "musical," "ambiguous," or "unknown". DIY Categories: Zero‑shot labels (e.g., home repair, plumbing) from a BART‑MNLI model. Comments: Up to 50 top‑level comments per video (JSON), along with a Boolean flag (Has_Comments). Channel Info: Uploader ID, title, and thumbnail URL. Engagement Metric: Calculated like-to-view ratio (Like_Count / (View_Count + 1)). Data Collection Methods: Video Discovery: Selenium-based scraping mimicking real-time user scrolling to dynamically load YouTube search results across varied queries. Metadata & Comments: Batches (max 50 videos per request) retrieved using the YouTube Data API. Transcripts: Retrieved using youtube_transcript_api. Videos with no captions are marked, with future work planned for Whisper-based ASR. Classification: Transcript type identified using heuristics (entropy, instructional keywords). DIY domain classification performed using zero-shot learning. Deduplication: Has a live set of processed video IDs to make dataset growth incremental and distinctive. DIY-Repair-Youtube-Dataset/ │ ├── data/ │ ├── video_metadata.csv # Main dataset file (6,015 rows × 19 columns) │ └── data_dictionary.csv # Definitions of each column/field │ ├── code/ │ ├── 00_setup_environment.sh │ ├── 01_data_collection.py │ ├── 02_filter_new_videos.py │ ├── 03_metadata_extraction.py │ ├── 04_transcript_classification.py │ ├── 05_diy_category_tagging.py │ ├── 06_comments_channel_info.py │ ├── 07_process_video_record.py │ └── 08_batch_processing.py │ └── utils.py │ ├── CITATION.cff ├── LICENSE └── README.md └── requirements.txt #The dataset is annotated manually and rebviewed the transcripts.

提供机构：

Mendeley Data

创建时间：

2025-04-23

5,000+

优质数据集

54 个

任务类型

进入经典数据集