five

DIY Repair Videos: A Multimodal YouTube Dataset for Instructional Content Analysis

收藏
NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://data.mendeley.com/datasets/8x3xkkymzx
下载链接
链接失效反馈
官方服务:
资源简介:
DIY Repair Videos: A Multimodal YouTube Dataset for Instructional Content Analysis Overview: This dataset contains 6 ,015 YouTube DIY‑repair tutorial videos, each enriched with structured metadata, transcripts, viewer comments, channel details, and a rigorous, multi‑round manual annotation of instructional content. Key Components: Metadata & Engagement Fields: Video_ID, Title, Description, Duration (ISO 8601 + seconds), View_Count, Like_Count, Comment_Count, Published_At, Thumbnail_URL Metric: Engagement_Ratio = Like_Count / (View_Count + 1) Transcripts: Source: YouTube auto‑captions (empty if unavailable) Fields: Transcript (raw text) Manual Rounds: TR_A1, TR_A2, TR_A3 — three independent transcript reviews (correcting major errors, marking non‑verbal segments) TR_Final — consolidated transcript after consensus DIY Category Annotation Manual Rounds: DIY_A1, DIY_A2, DIY_A3 — three independent category assignments using the annotation guide DIY_Final — consensus category after adjudication Coverage: 16 DIY sub‑domains (e.g., “home repair,” “plumbing,” “woodworking,” “other”) Reliability: Inter‑annotator agreement (Fleiss’s κ = 0.76) Comments: Fields: Comments (JSON array of up to 50 top‑level comments), Has_Comments (true if ≥ 20 total words) Channel Context: Fields: Channel_ID, Channel_Title, Channel_Thumbnail_URL Annotation Methodology: 1. Stratified Subset Selection A subset of 180 videos was sampled to represent all DIY categories proportionally. 2. Annotation Guide A concise manual defined each DIY category and outlined transcription conventions. 3. Independent Annotations Three team members performed Round 1–3 (DIY_A1–3 and TR_A1–3) without access to others’ labels. 4. Consensus Adjudication For each video, a fourth pass produced DIY_Final and TR_Final—the agreed‑upon labels and corrected transcript. DIY-Repair-Youtube-Dataset/ │ ├── data/ │ ├── video_metadata.csv # Main dataset file (6,015 rows × 19 columns) │ └── data_dictionary.csv # Definitions of each column/field │ ├── CITATION.cff ├── LICENSE └── README.md └── requirements.txt #The dataset is annotated manually and reviewed the transcripts.
创建时间:
2025-07-10
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作