DIY Repair Videos: A Multimodal YouTube Dataset for Instructional Content Analysis

Mendeley Data2026-04-18 收录

下载链接：

https://data.mendeley.com/datasets/8x3xkkymzx

下载链接

链接失效反馈

官方服务：

资源简介：

DIY Repair Videos: A Multimodal YouTube Dataset for Instructional Content Analysis Overview: This dataset contains 6 ,015 YouTube DIY‑repair tutorial videos, each enriched with structured metadata, transcripts, viewer comments, channel details, and a rigorous, multi‑round manual annotation of instructional content. Key Components: Metadata & Engagement Fields: Video_ID, Title, Description, Duration (ISO 8601 + seconds), View_Count, Like_Count, Comment_Count, Published_At, Thumbnail_URL Metric: Engagement_Ratio = Like_Count / (View_Count + 1) Transcripts: Source: YouTube auto‑captions (empty if unavailable) Fields: Transcript (raw text) Manual Rounds: TR_A1, TR_A2, TR_A3 — three independent transcript reviews (correcting major errors, marking non‑verbal segments) TR_Final — consolidated transcript after consensus DIY Category Annotation Manual Rounds: DIY_A1, DIY_A2, DIY_A3 — three independent category assignments using the annotation guide DIY_Final — consensus category after adjudication Coverage: 16 DIY sub‑domains (e.g., “home repair,” “plumbing,” “woodworking,” “other”) Reliability: Inter‑annotator agreement (Fleiss’s κ = 0.76) Comments: Fields: Comments (JSON array of up to 50 top‑level comments), Has_Comments (true if ≥ 20 total words) Channel Context: Fields: Channel_ID, Channel_Title, Channel_Thumbnail_URL Annotation Methodology: 1. Stratified Subset Selection A subset of 180 videos was sampled to represent all DIY categories proportionally. 2. Annotation Guide A concise manual defined each DIY category and outlined transcription conventions. 3. Independent Annotations Three team members performed Round 1–3 (DIY_A1–3 and TR_A1–3) without access to others’ labels. 4. Consensus Adjudication For each video, a fourth pass produced DIY_Final and TR_Final—the agreed‑upon labels and corrected transcript. DIY-Repair-Youtube-Dataset/ │ ├── data/ │ ├── video_metadata.csv # Main dataset file (6,015 rows × 19 columns) │ └── data_dictionary.csv # Definitions of each column/field │ ├── CITATION.cff ├── LICENSE └── README.md └── requirements.txt #The dataset is annotated manually and reviewed the transcripts.

DIY修复视频：用于教学内容分析的多模态YouTube数据集概述：本数据集包含6015条YouTube平台上的DIY修复教学视频，每条视频均附带结构化元数据、字幕、观众评论、频道详情，以及针对教学内容的多轮严谨人工标注。关键组成部分：元数据与互动指标字段：视频ID（Video_ID）、标题（Title）、描述（Description）、时长（ISO 8601格式+秒数）、播放量（View_Count）、点赞数（Like_Count）、评论数（Comment_Count）、发布时间（Published_At）、缩略图URL（Thumbnail_URL）互动率指标：互动率（Engagement_Ratio）= 点赞数（Like_Count） / （播放量（View_Count） + 1）字幕模块来源：YouTube自动字幕（无字幕时为空）字段：字幕内容（Transcript，原始文本）人工标注轮次： TR_A1、TR_A2、TR_A3：三轮独立的字幕审核，用于修正重大错误、标记非语音片段 TR_Final：经标注者协商一致后的整合版字幕 DIY类别标注人工标注轮次： DIY_A1、DIY_A2、DIY_A3：依据标注指南完成的三轮独立类别标注 DIY_Final：经仲裁协商后的最终类别覆盖范围：16个DIY子领域（例如“家居维修”“水管维修”“木工制作”及“其他”）标注可靠性：标注者间一致性（Fleiss氏κ值=0.76）评论数据字段：评论（Comments，最多50条顶级评论的JSON数组）、是否有有效评论（Has_Comments，当总词数≥20时为真）频道上下文信息字段：频道ID（Channel_ID）、频道标题（Channel_Title）、频道缩略图URL（Channel_Thumbnail_URL）标注流程： 1. 分层抽样选择按比例抽取覆盖所有DIY类别的180条视频作为抽样子集。 2. 标注指南一份简明手册明确了各DIY类别的定义，并规定了字幕标注规范。 3. 独立标注 3名标注人员独立完成第1至3轮标注（DIY_A1至DIY_A3及TR_A1至TR_A3），且不可查看其他标注者的标注结果。 4. 协商仲裁针对每条视频，通过第四轮审核生成DIY_Final与TR_Final，即最终协商一致的类别标签与修正后的字幕。数据集目录结构： DIY-Repair-Youtube-Dataset/ │ ├── data/ │ ├── video_metadata.csv # 主数据集文件（6015行 × 19列） │ └── data_dictionary.csv # 各字段/列的定义说明 │ ├── CITATION.cff ├── LICENSE └── README.md └── requirements.txt 注：本数据集经人工标注与字幕审核。

创建时间：

2025-07-10

5,000+

优质数据集

54 个

任务类型

进入经典数据集