HowTo100M
收藏OpenDataLab2026-05-17 更新2024-05-09 收录
下载链接:
https://opendatalab.org.cn/OpenDataLab/HowTo100M
下载链接
链接失效反馈官方服务:
资源简介:
本文的目的是建立一个时间对齐网络,该网络吸收长期视频序列和相关的文本句子,以便 :( 1) 确定句子是否与视频对齐; (2) 如果可以对齐,则确定其对齐。面临的挑战是从大规模数据集 (例如HowTo100M) 训练此类网络,其中相关的文本句子具有明显的噪声,并且仅在相关时才弱对齐。
除了提出对齐网络之外,我们还做出了四个贡献 :( i) 我们描述了一种新颖的联合训练方法,尽管噪音很大,但可以在不使用手动注释的情况下对原始教学视频进行降噪和训练; (ii) 基准对齐性能,我们手动策划了HowTo100M的10小时子集,总共80个视频,并带有稀疏的时间描述。我们提出的模型,在HowTo100M上训练,在这个对齐数据集上的强基线 (CLIP,MIL-NCE) 的显著优势; (iii) 我们将训练好的模型应用于多个下游视频理解任务,并实现最先进的结果,包括YouCook2上的文本视频检索,以及早餐动作上的弱监督视频动作分割; (iv) 我们使用自动对齐的HowTo100M注释进行骨干模型的端到端微调,并在下游动作识别任务上获得了改进的性能。
The purpose of this paper is to develop a temporal alignment network that takes long-form video sequences and their associated textual sentences as inputs, aiming to achieve two objectives: (1) determine whether a given sentence aligns with the corresponding video; (2) if the sentence is alignable, identify its precise temporal alignment. The core challenge lies in training such a network on large-scale datasets such as HowTo100M, where the associated textual sentences contain notable noise and are only weakly aligned even when they are relevant to the video.
Besides proposing the alignment network, we make four contributions: (i) We describe a novel joint training method that enables denoising and training on raw instructional videos without relying on manual annotations, despite the presence of substantial noise; (ii) To establish a benchmark for alignment performance, we manually curated a 10-hour subset of HowTo100M, which includes 80 videos paired with sparse temporal descriptions. Our proposed model, trained on HowTo100M, outperforms strong baselines including CLIP and MIL-NCE by a significant margin on this alignment dataset; (iii) We apply the trained model to multiple downstream video understanding tasks and achieve state-of-the-art results, including text-video retrieval on YouCook2 and weakly-supervised video action segmentation on the Breakfast Actions dataset; (iv) We conduct end-to-end fine-tuning of the backbone model using automatically aligned annotations from HowTo100M, and obtain improved performance on downstream action recognition tasks.
提供机构:
OpenDataLab
创建时间:
2023-02-13
搜集汇总
数据集介绍

背景与挑战
背景概述
HowTo100M是一个大规模多模态数据集,主要用于训练时间对齐网络,以处理视频序列和相关的文本句子,其中文本具有噪声且仅弱对齐。该数据集由上海交通大学和牛津大学视觉几何集团于2022年发布,支持视频理解任务,如文本视频检索和弱监督视频动作分割,并在下游任务中实现先进性能。
以上内容由遇见数据集搜集并总结生成



