five

DIY Repair Videos: A Multimodal YouTube Dataset for Instructional Content Analysis

收藏
DataCite Commons2025-05-01 更新2025-05-17 收录
下载链接:
https://data.mendeley.com/datasets/8x3xkkymzx
下载链接
链接失效反馈
官方服务:
资源简介:
DIY Repair Videos: A Multimodal YouTube Dataset for Instructional Content Analysis Overview: This dataset consists of 6,015 YouTube videos focused on do‑it‑yourself fix-it repair tutorials. Each video is accompanied by structured metadata, natural language transcripts, community engagement signals, and inferred instructional categories. The dataset is designed to support research in multimodal content understanding, instructional video analysis, and human-computer interaction. For every video, we've included: Metadata: Titles, descriptions, durations (ISO 8601 & seconds), view/like/comment counts, published dates, and thumbnail URLs. Transcripts: Auto‑captions (blank if not available), labeled as "instructional," "musical," "ambiguous," or "unknown". DIY Categories: Zero‑shot labels (e.g., home repair, plumbing) from a BART‑MNLI model. Comments: Up to 50 top‑level comments per video (JSON), along with a Boolean flag (Has_Comments). Channel Info: Uploader ID, title, and thumbnail URL. Engagement Metric: Calculated like-to-view ratio (Like_Count / (View_Count + 1)). Data Collection Methods: Video Discovery: Selenium-based scraping mimicking real-time user scrolling to dynamically load YouTube search results across varied queries. Metadata & Comments: Batches (max 50 videos per request) retrieved using the YouTube Data API. Transcripts: Retrieved using youtube_transcript_api. Videos with no captions are marked, with future work planned for Whisper-based ASR. Classification: Transcript type identified using heuristics (entropy, instructional keywords). DIY domain classification performed using zero-shot learning. Deduplication: Has a live set of processed video IDs to make dataset growth incremental and distinctive. DIY-Repair-Youtube-Dataset/ │ ├── data/ │ ├── video_metadata.csv # Main dataset file (6,015 rows × 19 columns) │ └── data_dictionary.csv # Definitions of each column/field │ ├── code/ │ ├── 00_setup_environment.sh │ ├── 01_data_collection.py │ ├── 02_filter_new_videos.py │ ├── 03_metadata_extraction.py │ ├── 04_transcript_classification.py │ ├── 05_diy_category_tagging.py │ ├── 06_comments_channel_info.py │ ├── 07_process_video_record.py │ └── 08_batch_processing.py │ └── utils.py │ ├── CITATION.cff ├── LICENSE └── README.md └── requirements.txt #The dataset is annotated manually and rebviewed the transcripts.
提供机构:
Mendeley Data
创建时间:
2025-04-23
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作