HowTo100M

Name: HowTo100M
Creator: maas
Published: 2025-12-05 16:15:52
License: 暂无描述

魔搭社区2025-12-05 更新2024-08-31 收录

下载链接：

https://modelscope.cn/datasets/OmniData/HowTo100M

下载链接

链接失效反馈

官方服务：

资源简介：

displayName: HowTo100M labelTypes: [] license: - MIT mediaTypes: [] paperUrl: https://openaccess.thecvf.com/content/CVPR2022/papers/Han_Temporal_Alignment_Networks_for_Long-Term_Video_CVPR_2022_paper.pdf publishDate: "2022" publishUrl: https://www.di.ens.fr/willow/research/howto100m/ publisher: - Shanghai Jiao Tong University - Visual Geometry Group, University of Oxford tags: - Food - Huamn taskTypes: [] --- # 数据集介绍 ## 简介本文的目的是建立一个时间对齐网络，该网络吸收长期视频序列和相关的文本句子，以便 :( 1) 确定句子是否与视频对齐; (2) 如果可以对齐，则确定其对齐。面临的挑战是从大规模数据集 (例如HowTo100M) 训练此类网络，其中相关的文本句子具有明显的噪声，并且仅在相关时才弱对齐。除了提出对齐网络之外，我们还做出了四个贡献 :( i) 我们描述了一种新颖的联合训练方法，尽管噪音很大，但可以在不使用手动注释的情况下对原始教学视频进行降噪和训练; (ii) 基准对齐性能，我们手动策划了HowTo100M的10小时子集，总共80个视频，并带有稀疏的时间描述。我们提出的模型，在HowTo100M上训练，在这个对齐数据集上的强基线 (CLIP，MIL-NCE) 的显著优势; (iii) 我们将训练好的模型应用于多个下游视频理解任务，并实现最先进的结果，包括YouCook2上的文本视频检索，以及早餐动作上的弱监督视频动作分割; (iv) 我们使用自动对齐的HowTo100M注释进行骨干模型的端到端微调，并在下游动作识别任务上获得了改进的性能。 ## Download dataset :modelscope-code[]{type="git"}

displayName: HowTo100M labelTypes: [] license: - MIT mediaTypes: [] paperUrl: https://openaccess.thecvf.com/content/CVPR2022/papers/Han_Temporal_Alignment_Networks_for_Long-Term_Video_CVPR_2022_paper.pdf publishDate: "2022" publishUrl: https://www.di.ens.fr/willow/research/howto100m/ publisher: - Shanghai Jiao Tong University - Visual Geometry Group, University of Oxford tags: - Food - Human taskTypes: [] --- # Dataset Introduction ## Overview The purpose of this work is to develop a temporal alignment network that takes long-term video sequences and their associated text sentences as input, with two core objectives: (1) to determine whether a given text sentence aligns with a corresponding video; (2) if an alignment is feasible, to identify the specific alignment between the sentence and the video. The primary challenge lies in training such networks on large-scale datasets such as HowTo100M, where the accompanying text sentences suffer from substantial noise and are only weakly aligned when relevant to the video content. In addition to proposing the temporal alignment network, we make four key contributions: (i) We present a novel joint training approach that enables denoising and training on raw instructional videos without relying on manual annotations, even with significant noise present in the dataset; (ii) To establish a benchmark for alignment performance evaluation, we manually curated a 10-hour subset of HowTo100M, consisting of 80 videos in total, each paired with sparse temporal descriptions. Our proposed model, trained on the full HowTo100M dataset, outperforms strong baseline models including CLIP and MIL-NCE by a significant margin on this curated alignment dataset; (iii) We apply the trained model to multiple downstream video understanding tasks and achieve state-of-the-art results, including text-video retrieval on YouCook2 and weakly-supervised video action segmentation on the Breakfast Actions dataset; (iv) We use the automatically aligned annotations from HowTo100M to perform end-to-end fine-tuning of the backbone model, leading to improved performance on downstream action recognition tasks. ## Download Dataset :modelscope-code[]{type="git"}

提供机构：

maas

创建时间：

2024-07-02

搜集汇总

数据集介绍