Refer-YouTube-VOS

Name: Refer-YouTube-VOS
Creator: OpenDataLab
Published: 2026-05-17 09:30:22
License: 暂无描述

OpenDataLab2026-05-17 更新2024-05-09 收录

下载链接：

https://opendatalab.org.cn/OpenDataLab/Refer-YouTube-VOS

下载链接

链接失效反馈

官方服务：

资源简介：

以前的作品 [6, 10] 为视频构建了参考分割数据集。加夫里柳克等人。 [6] 用自然句子扩展了 A2D [33] 和 J-HMDB [9] 数据集；数据集专注于描述视频中出现的“演员”和“动作”，因此实例注释仅限于与执行突出“动作”的主要“演员”相对应的少数对象类别。 Khoreva 等人。 [10] 基于 DAVIS [25] 构建了一个数据集，但规模几乎不足以从头开始学习端到端模型 Youtube-VOS 有 4,519 个具有 94 个常见对象类别的高分辨率视频。每个视频在 30-fps 视频中每 5 帧都有像素级实例分割注释，它们的持续时间约为 3 到 6 秒。我们使用 Amazon Mechanical Turk 来注释引用表达式。为了确保注释的质量，我们在验证测试后选择了大约 50 个 turker。每个 turker 都有一对视频，原始视频和掩码覆盖的视频，其中突出显示了目标对象，并被要求提供 20 个单词内准确描述目标对象的判别句。我们收集了两种注释，它们描述了突出显示的对象（1）基于整个视频（全视频表达）和（2）仅使用视频的第一帧（第一帧表达）。在初始注释之后，我们对所有注释进行了验证和清理工作，如果仅使用语言表达式无法本地化对象，则删除对象。以下是验证后数据集的两种标注类型的统计和分析。全视频表达：Youtube-VOS 在训练和验证拆分中分别有 6,459 和 1,063 个唯一对象。其中，我们覆盖了 3,471 个视频中的 6,388 个唯一对象（6, 388/6, 459 = 98.9%），训练拆分中的 12,913 个表达式和 507 个视频中的 1,063 个唯一对象（1, 063/1, 063 = 100%），其中 2,096验证拆分中的表达式。平均而言，每个视频有 3.8 个语言表达，每个表达有 10.0 个单词。第一帧表达式：3,412 个视频中有 6,006 个唯一对象（6, 006 /6, 459 = 93.0%），训练拆分中有 10,897 个表达式，507 个视频中有 1,030 个唯一对象（1, 030/1, 063 = 96.9%）验证拆分中有 1,993 个表达式。注释对象的数量低于完整视频表达式的数量，因为仅使用第一帧会使注释更加模糊和不一致，并且我们在验证过程中丢弃了更多注释。平均而言，每个视频有 3.2 个语言表达，每个表达有 7.5 个单词。

Prior works [6, 10] have developed reference segmentation datasets for videos. Gavriilyuk et al. [6] extended the A2D [33] and J-HMDB [9] datasets with natural sentences; these datasets focus on describing "actors" and "actions" appearing in videos, so instance annotations are limited to a small number of object categories corresponding to the primary "actors" performing prominent "actions". Khoreva et al. [10] built a dataset based on DAVIS [25], but its scale is nearly insufficient for training end-to-end models from scratch. The YouTube-VOS dataset consists of 4,519 high-resolution videos covering 94 common object categories. Each video has pixel-level instance segmentation annotations sampled every 5 frames from the 30-fps footage, with a duration of approximately 3 to 6 seconds. We utilized Amazon Mechanical Turk to annotate referring expressions. To ensure annotation quality, we selected approximately 50 Turkers after a validation test. Each Turker was provided with a pair of videos: the original video and a masked video where the target object was highlighted, and were asked to write discriminative sentences within 20 words to accurately describe the target object. We collected two types of annotations: the highlighted objects were described (1) based on the entire video (full-video expressions) and (2) using only the first frame of the video (first-frame expressions). Following the initial annotation phase, we conducted validation and cleaning for all annotations, removing objects that could not be localized solely using the linguistic expressions. Below are the statistics and analysis of the two annotation types for the post-validation dataset. Full-video expressions: The YouTube-VOS dataset has 6,459 and 1,063 unique objects in its training and validation splits, respectively. For these, we covered 6,388 unique objects across 3,471 videos (6,388/6,459 = 98.9%), with 12,913 expressions in the training split; for the validation split, we covered all 1,063 unique objects across 507 videos (1,063/1,063 = 100%), with 2,096 expressions. On average, each video has 3.8 linguistic expressions, with each expression containing 10.0 words on average. First-frame expressions: We covered 6,006 unique objects across 3,412 videos (6,006/6,459 = 93.0%), with 10,897 expressions in the training split; for the validation split, we covered 1,030 unique objects across 507 videos (1,030/1,063 = 96.9%), with 1,993 expressions. The number of annotated objects is lower than that of full-video expressions, as using only the first frame makes annotations more ambiguous and inconsistent, leading us to discard more annotations during the validation process. On average, each video has 3.2 linguistic expressions, with each expression containing 7.5 words on average.

提供机构：

OpenDataLab

创建时间：

2022-08-16

搜集汇总

数据集介绍

背景与挑战

背景概述

Refer-YouTube-VOS是一个大规模的视频对象分割数据集，专注于文本指称表达分割任务，即通过自然语言描述来定位和分割视频中的特定对象。该数据集基于YouTube-VOS构建，包含两种标注类型：全视频表达和第一帧表达，覆盖了训练拆分中的6,388个唯一对象和12,913个表达式，以及验证拆分中的1,063个唯一对象和2,096个表达式，平均每个视频有3.2到3.8个语言表达。数据集由Adobe Research和首尔国立大学于2020年发布，遵循CC BY 4.0许可，适用于视频理解和语言-视觉交互研究。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集