DIY Repair Videos: A Multimodal YouTube Dataset for Instructional Content Analysis
收藏NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://data.mendeley.com/datasets/8x3xkkymzx
下载链接
链接失效反馈官方服务:
资源简介:
DIY Repair Videos: A Multimodal YouTube Dataset for Instructional Content Analysis
Overview:
This dataset contains 6 ,015 YouTube DIY‑repair tutorial videos, each enriched with structured metadata, transcripts, viewer comments, channel details, and a rigorous, multi‑round manual annotation of instructional content.
Key Components:
Metadata & Engagement
Fields: Video_ID, Title, Description, Duration (ISO 8601 + seconds), View_Count, Like_Count, Comment_Count, Published_At, Thumbnail_URL
Metric: Engagement_Ratio = Like_Count / (View_Count + 1)
Transcripts:
Source: YouTube auto‑captions (empty if unavailable)
Fields: Transcript (raw text)
Manual Rounds:
TR_A1, TR_A2, TR_A3 — three independent transcript reviews (correcting major errors, marking non‑verbal segments)
TR_Final — consolidated transcript after consensus
DIY Category Annotation
Manual Rounds:
DIY_A1, DIY_A2, DIY_A3 — three independent category assignments using the annotation guide
DIY_Final — consensus category after adjudication
Coverage: 16 DIY sub‑domains (e.g., “home repair,” “plumbing,” “woodworking,” “other”)
Reliability: Inter‑annotator agreement (Fleiss’s κ = 0.76)
Comments:
Fields: Comments (JSON array of up to 50 top‑level comments), Has_Comments (true if ≥ 20 total words)
Channel Context:
Fields: Channel_ID, Channel_Title, Channel_Thumbnail_URL
Annotation Methodology:
1. Stratified Subset Selection
A subset of 180 videos was sampled to represent all DIY categories proportionally.
2. Annotation Guide
A concise manual defined each DIY category and outlined transcription conventions.
3. Independent Annotations
Three team members performed Round 1–3 (DIY_A1–3 and TR_A1–3) without access to others’ labels.
4. Consensus Adjudication
For each video, a fourth pass produced DIY_Final and TR_Final—the agreed‑upon labels and corrected transcript.
DIY-Repair-Youtube-Dataset/
│
├── data/
│ ├── video_metadata.csv # Main dataset file (6,015 rows × 19 columns)
│ └── data_dictionary.csv # Definitions of each column/field
│
├── CITATION.cff
├── LICENSE
└── README.md
└── requirements.txt
#The dataset is annotated manually and reviewed the transcripts.
创建时间:
2025-07-10



