DIY Repair Videos: A Multimodal YouTube Dataset for Instructional Content Analysis
收藏Mendeley Data2026-04-18 收录
下载链接:
https://data.mendeley.com/datasets/8x3xkkymzx
下载链接
链接失效反馈官方服务:
资源简介:
DIY Repair Videos: A Multimodal YouTube Dataset for Instructional Content Analysis
Overview:
This dataset contains 6 ,015 YouTube DIY‑repair tutorial videos, each enriched with structured metadata, transcripts, viewer comments, channel details, and a rigorous, multi‑round manual annotation of instructional content.
Key Components:
Metadata & Engagement
Fields: Video_ID, Title, Description, Duration (ISO 8601 + seconds), View_Count, Like_Count, Comment_Count, Published_At, Thumbnail_URL
Metric: Engagement_Ratio = Like_Count / (View_Count + 1)
Transcripts:
Source: YouTube auto‑captions (empty if unavailable)
Fields: Transcript (raw text)
Manual Rounds:
TR_A1, TR_A2, TR_A3 — three independent transcript reviews (correcting major errors, marking non‑verbal segments)
TR_Final — consolidated transcript after consensus
DIY Category Annotation
Manual Rounds:
DIY_A1, DIY_A2, DIY_A3 — three independent category assignments using the annotation guide
DIY_Final — consensus category after adjudication
Coverage: 16 DIY sub‑domains (e.g., “home repair,” “plumbing,” “woodworking,” “other”)
Reliability: Inter‑annotator agreement (Fleiss’s κ = 0.76)
Comments:
Fields: Comments (JSON array of up to 50 top‑level comments), Has_Comments (true if ≥ 20 total words)
Channel Context:
Fields: Channel_ID, Channel_Title, Channel_Thumbnail_URL
Annotation Methodology:
1. Stratified Subset Selection
A subset of 180 videos was sampled to represent all DIY categories proportionally.
2. Annotation Guide
A concise manual defined each DIY category and outlined transcription conventions.
3. Independent Annotations
Three team members performed Round 1–3 (DIY_A1–3 and TR_A1–3) without access to others’ labels.
4. Consensus Adjudication
For each video, a fourth pass produced DIY_Final and TR_Final—the agreed‑upon labels and corrected transcript.
DIY-Repair-Youtube-Dataset/
│
├── data/
│ ├── video_metadata.csv # Main dataset file (6,015 rows × 19 columns)
│ └── data_dictionary.csv # Definitions of each column/field
│
├── CITATION.cff
├── LICENSE
└── README.md
└── requirements.txt
#The dataset is annotated manually and reviewed the transcripts.
DIY修复视频:用于教学内容分析的多模态YouTube数据集
概述:本数据集包含6015条YouTube平台上的DIY修复教学视频,每条视频均附带结构化元数据、字幕、观众评论、频道详情,以及针对教学内容的多轮严谨人工标注。
关键组成部分:
元数据与互动指标
字段:视频ID(Video_ID)、标题(Title)、描述(Description)、时长(ISO 8601格式+秒数)、播放量(View_Count)、点赞数(Like_Count)、评论数(Comment_Count)、发布时间(Published_At)、缩略图URL(Thumbnail_URL)
互动率指标:互动率(Engagement_Ratio)= 点赞数(Like_Count) / (播放量(View_Count) + 1)
字幕模块
来源:YouTube自动字幕(无字幕时为空)
字段:字幕内容(Transcript,原始文本)
人工标注轮次:
TR_A1、TR_A2、TR_A3:三轮独立的字幕审核,用于修正重大错误、标记非语音片段
TR_Final:经标注者协商一致后的整合版字幕
DIY类别标注
人工标注轮次:
DIY_A1、DIY_A2、DIY_A3:依据标注指南完成的三轮独立类别标注
DIY_Final:经仲裁协商后的最终类别
覆盖范围:16个DIY子领域(例如“家居维修”“水管维修”“木工制作”及“其他”)
标注可靠性:标注者间一致性(Fleiss氏κ值=0.76)
评论数据
字段:评论(Comments,最多50条顶级评论的JSON数组)、是否有有效评论(Has_Comments,当总词数≥20时为真)
频道上下文信息
字段:频道ID(Channel_ID)、频道标题(Channel_Title)、频道缩略图URL(Channel_Thumbnail_URL)
标注流程:
1. 分层抽样选择
按比例抽取覆盖所有DIY类别的180条视频作为抽样子集。
2. 标注指南
一份简明手册明确了各DIY类别的定义,并规定了字幕标注规范。
3. 独立标注
3名标注人员独立完成第1至3轮标注(DIY_A1至DIY_A3及TR_A1至TR_A3),且不可查看其他标注者的标注结果。
4. 协商仲裁
针对每条视频,通过第四轮审核生成DIY_Final与TR_Final,即最终协商一致的类别标签与修正后的字幕。
数据集目录结构:
DIY-Repair-Youtube-Dataset/
│
├── data/
│ ├── video_metadata.csv # 主数据集文件(6015行 × 19列)
│ └── data_dictionary.csv # 各字段/列的定义说明
│
├── CITATION.cff
├── LICENSE
└── README.md
└── requirements.txt
注:本数据集经人工标注与字幕审核。
创建时间:
2025-07-10



