DIY Repair Videos: A Multimodal YouTube Dataset for Instructional Content Analysis

NIAID Data Ecosystem2026-05-02 收录

下载链接：

https://data.mendeley.com/datasets/8x3xkkymzx

下载链接

链接失效反馈

官方服务：

资源简介：

DIY Repair Videos: A Multimodal YouTube Dataset for Instructional Content Analysis Overview: This dataset contains 6 ,015 YouTube DIY‑repair tutorial videos, each enriched with structured metadata, transcripts, viewer comments, channel details, and a rigorous, multi‑round manual annotation of instructional content. Key Components: Metadata & Engagement Fields: Video_ID, Title, Description, Duration (ISO 8601 + seconds), View_Count, Like_Count, Comment_Count, Published_At, Thumbnail_URL Metric: Engagement_Ratio = Like_Count / (View_Count + 1) Transcripts: Source: YouTube auto‑captions (empty if unavailable) Fields: Transcript (raw text) Manual Rounds: TR_A1, TR_A2, TR_A3 — three independent transcript reviews (correcting major errors, marking non‑verbal segments) TR_Final — consolidated transcript after consensus DIY Category Annotation Manual Rounds: DIY_A1, DIY_A2, DIY_A3 — three independent category assignments using the annotation guide DIY_Final — consensus category after adjudication Coverage: 16 DIY sub‑domains (e.g., “home repair,” “plumbing,” “woodworking,” “other”) Reliability: Inter‑annotator agreement (Fleiss’s κ = 0.76) Comments: Fields: Comments (JSON array of up to 50 top‑level comments), Has_Comments (true if ≥ 20 total words) Channel Context: Fields: Channel_ID, Channel_Title, Channel_Thumbnail_URL Annotation Methodology: 1. Stratified Subset Selection A subset of 180 videos was sampled to represent all DIY categories proportionally. 2. Annotation Guide A concise manual defined each DIY category and outlined transcription conventions. 3. Independent Annotations Three team members performed Round 1–3 (DIY_A1–3 and TR_A1–3) without access to others’ labels. 4. Consensus Adjudication For each video, a fourth pass produced DIY_Final and TR_Final—the agreed‑upon labels and corrected transcript. DIY-Repair-Youtube-Dataset/ │ ├── data/ │ ├── video_metadata.csv # Main dataset file (6,015 rows × 19 columns) │ └── data_dictionary.csv # Definitions of each column/field │ ├── CITATION.cff ├── LICENSE └── README.md └── requirements.txt #The dataset is annotated manually and reviewed the transcripts.

创建时间：

2025-07-10

5,000+

优质数据集

54 个

任务类型

进入经典数据集