five

DIY Repair Videos: A Multimodal YouTube Dataset for Instructional Content Analysis

收藏
Mendeley Data2026-04-18 收录
下载链接:
https://data.mendeley.com/datasets/8x3xkkymzx
下载链接
链接失效反馈
官方服务:
资源简介:
DIY Repair Videos: A Multimodal YouTube Dataset for Instructional Content Analysis Overview: This dataset contains 6 ,015 YouTube DIY‑repair tutorial videos, each enriched with structured metadata, transcripts, viewer comments, channel details, and a rigorous, multi‑round manual annotation of instructional content. Key Components: Metadata & Engagement Fields: Video_ID, Title, Description, Duration (ISO 8601 + seconds), View_Count, Like_Count, Comment_Count, Published_At, Thumbnail_URL Metric: Engagement_Ratio = Like_Count / (View_Count + 1) Transcripts: Source: YouTube auto‑captions (empty if unavailable) Fields: Transcript (raw text) Manual Rounds: TR_A1, TR_A2, TR_A3 — three independent transcript reviews (correcting major errors, marking non‑verbal segments) TR_Final — consolidated transcript after consensus DIY Category Annotation Manual Rounds: DIY_A1, DIY_A2, DIY_A3 — three independent category assignments using the annotation guide DIY_Final — consensus category after adjudication Coverage: 16 DIY sub‑domains (e.g., “home repair,” “plumbing,” “woodworking,” “other”) Reliability: Inter‑annotator agreement (Fleiss’s κ = 0.76) Comments: Fields: Comments (JSON array of up to 50 top‑level comments), Has_Comments (true if ≥ 20 total words) Channel Context: Fields: Channel_ID, Channel_Title, Channel_Thumbnail_URL Annotation Methodology: 1. Stratified Subset Selection A subset of 180 videos was sampled to represent all DIY categories proportionally. 2. Annotation Guide A concise manual defined each DIY category and outlined transcription conventions. 3. Independent Annotations Three team members performed Round 1–3 (DIY_A1–3 and TR_A1–3) without access to others’ labels. 4. Consensus Adjudication For each video, a fourth pass produced DIY_Final and TR_Final—the agreed‑upon labels and corrected transcript. DIY-Repair-Youtube-Dataset/ │ ├── data/ │ ├── video_metadata.csv # Main dataset file (6,015 rows × 19 columns) │ └── data_dictionary.csv # Definitions of each column/field │ ├── CITATION.cff ├── LICENSE └── README.md └── requirements.txt #The dataset is annotated manually and reviewed the transcripts.

DIY修复视频:用于教学内容分析的多模态YouTube数据集 概述:本数据集包含6015条YouTube平台上的DIY修复教学视频,每条视频均附带结构化元数据、字幕、观众评论、频道详情,以及针对教学内容的多轮严谨人工标注。 关键组成部分: 元数据与互动指标 字段:视频ID(Video_ID)、标题(Title)、描述(Description)、时长(ISO 8601格式+秒数)、播放量(View_Count)、点赞数(Like_Count)、评论数(Comment_Count)、发布时间(Published_At)、缩略图URL(Thumbnail_URL) 互动率指标:互动率(Engagement_Ratio)= 点赞数(Like_Count) / (播放量(View_Count) + 1) 字幕模块 来源:YouTube自动字幕(无字幕时为空) 字段:字幕内容(Transcript,原始文本) 人工标注轮次: TR_A1、TR_A2、TR_A3:三轮独立的字幕审核,用于修正重大错误、标记非语音片段 TR_Final:经标注者协商一致后的整合版字幕 DIY类别标注 人工标注轮次: DIY_A1、DIY_A2、DIY_A3:依据标注指南完成的三轮独立类别标注 DIY_Final:经仲裁协商后的最终类别 覆盖范围:16个DIY子领域(例如“家居维修”“水管维修”“木工制作”及“其他”) 标注可靠性:标注者间一致性(Fleiss氏κ值=0.76) 评论数据 字段:评论(Comments,最多50条顶级评论的JSON数组)、是否有有效评论(Has_Comments,当总词数≥20时为真) 频道上下文信息 字段:频道ID(Channel_ID)、频道标题(Channel_Title)、频道缩略图URL(Channel_Thumbnail_URL) 标注流程: 1. 分层抽样选择 按比例抽取覆盖所有DIY类别的180条视频作为抽样子集。 2. 标注指南 一份简明手册明确了各DIY类别的定义,并规定了字幕标注规范。 3. 独立标注 3名标注人员独立完成第1至3轮标注(DIY_A1至DIY_A3及TR_A1至TR_A3),且不可查看其他标注者的标注结果。 4. 协商仲裁 针对每条视频,通过第四轮审核生成DIY_Final与TR_Final,即最终协商一致的类别标签与修正后的字幕。 数据集目录结构: DIY-Repair-Youtube-Dataset/ │ ├── data/ │ ├── video_metadata.csv # 主数据集文件(6015行 × 19列) │ └── data_dictionary.csv # 各字段/列的定义说明 │ ├── CITATION.cff ├── LICENSE └── README.md └── requirements.txt 注:本数据集经人工标注与字幕审核。
创建时间:
2025-07-10
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作