five

HuggingFaceM4/howto100m

收藏
Hugging Face2022-05-18 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/HuggingFaceM4/howto100m
下载链接
链接失效反馈
官方服务:
资源简介:
# Dataset Card for HowTo100M ## Table of Contents [Table of Contents](#table-of-contents) [Dataset Description](#dataset-description) [Dataset Summary](#dataset-summary) [Dataset Preprocessing](#dataset-preprocessing) [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) [Languages](#languages) [Dataset Structure](#dataset-structure) [Data Instances](#data-instances) [Data Fields](#data-fields) [Data Splits](#data-splits) [Dataset Creation](#dataset-creation) [Curation Rationale](#curation-rationale) [Source Data](#source-data) [Annotations](#annotations) [Personal and Sensitive Information](#personal-and-sensitive-information) [Considerations for Using the Data](#considerations-for-using-the-data) [Social Impact of Dataset](#social-impact-of-dataset) [Discussion of Biases](#discussion-of-biases) [Other Known Limitations](#other-known-limitations) [Additional Information](#additional-information) [Dataset Curators](#dataset-curators) [Licensing Information](#licensing-information) [Citation Information](#citation-information) [Contributions](#contributions) ## Dataset Description **Homepage:** [HowTo100M homepage](https://www.di.ens.fr/willow/research/howto100m/) **Repository:** [Github repo](https://github.com/antoine77340/howto100m) **Paper:** [HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips](https://github.com/antoine77340/howto100m) **Point of Contact:** Antoine Miech ### Dataset Summary HowTo100M is a large-scale dataset of narrated videos with an emphasis on instructional videos where content creators teach complex tasks with an explicit intention of explaining the visual content on screen. HowTo100M features a total of: 136M video clips with captions sourced from 1.2M Youtube videos (15 years of video) 23k activities from domains such as cooking, hand crafting, personal care, gardening or fitness Each video is associated with a narration available as subtitles automatically downloaded from Youtube. ### Dataset Preprocessing This dataset does not contain the videos by default. You would need to follow the instructions [here](https://www.di.ens.fr/willow/research/howto100m/) from the dataset creators and fill out a form to get a userd id and a password to download the videos from their server. Once you have these two, you can fetch the videos by mapping the following function to the `path` column: ``` import requests USER_ID = "THE_USER_ID" PASSWORD = "THE_PASSWORD" def fetch_video(url): response = requests.get(url, auth=requests.auth.HTTPBasicAuth(USER_ID, PASSWORD)) return response.content ``` ### Supported Tasks and Leaderboards `video-to-text`: This dataset can be used to train a model for Video Captioning where the goal is to predict a caption given the video. ### Languages All captions are in English and are either coming from available YouTube subtitles (manually written) or the output of an Automatic Speech Recognition system. ## Dataset Structure ### Data Instances Each instance in HowTo100M represents a single video with two lists of start and end of segments and a caption for each segment. ``` { 'video_id': 'AEytW9ScgCw', 'path': 'http://howto100m.inria.fr/dataset/AEytW9ScgCw.mp4', 'category_1': 'Cars & Other Vehicles', 'category_2': 'Motorcycles', 'rank': 108, 'task_description': 'Paint a Motorcycle Tank', 'starts': [6.019999980926514, 9.449999809265137, 12.539999961853027, 15.449999809265137, 19.5, 23.510000228881836, 24.860000610351562, 27.420000076293945, 29.510000228881836, 33.119998931884766, 34.77000045776367, 40.68000030517578, 42.779998779296875, 45.97999954223633, 48.22999954223633, 51.93000030517578, 101.27999877929688, 112.80999755859375, 120.93000030517578, 123.79000091552734, 127.38999938964844, 134.86000061035156, 142.25999450683594, 145.47999572753906, 148.22000122070312, 150.0399932861328, 152.9499969482422, 154.97000122070312, 158.6300048828125, 159.75999450683594, 164.97999572753906, 166.7899932861328, 170.38999938964844, 174.91000366210938, 181.89999389648438, 184.33999633789062, 188.9499969482422, 194.38999938964844, 197.0, 201.11000061035156, 202.07000732421875, 247.32000732421875, 254.0399932861328, 256.8500061035156, 260.20001220703125, 271.4599914550781, 272.0, 276.55999755859375, 277.3399963378906, 281.6600036621094, 284.05999755859375, 287.5299987792969, 289.5799865722656, 291.5299987792969, 293.8699951171875, 296.0899963378906, 302.80999755859375, 309.0799865722656, 313.5199890136719, 317.17999267578125, 319.7200012207031, 323.0299987792969, 327.0799865722656, 329.1199951171875, 331.7799987792969, 335.3800048828125, 337.489990234375, 340.42999267578125, 345.1300048828125, 348.5899963378906, 351.1600036621094, 354.75, 357.0, 358.739990234375, 360.239990234375, 364.739990234375, 365.9100036621094, 367.5, 369.8399963378906, 371.2799987792969, 373.260009765625, 395.7699890136719, 401.9800109863281, 404.7799987792969, 406.9100036621094, 410.1499938964844, 415.05999755859375, 419.05999755859375, 427.5199890136719, 431.69000244140625, 433.42999267578125], 'ends': [12.539999961853027, 15.449999809265137, 19.5, 23.510000228881836, 24.860000610351562, 27.420000076293945, 29.510000228881836, 33.119998931884766, 34.77000045776367, 36.93000030517578, 40.68000030517578, 45.97999954223633, 48.22999954223633, 51.93000030517578, 56.529998779296875, 56.529998779296875, 105.38999938964844, 119.25, 127.38999938964844, 134.86000061035156, 141.33999633789062, 141.33999633789062, 148.22000122070312, 150.0399932861328, 152.9499969482422, 154.97000122070312, 158.6300048828125, 159.75999450683594, 164.97999572753906, 166.7899932861328, 170.38999938964844, 174.91000366210938, 181.17999267578125, 181.17999267578125, 188.9499969482422, 194.38999938964844, 197.0, 201.11000061035156, 202.07000732421875, 204.0800018310547, 218.30999755859375, 256.8500061035156, 260.20001220703125, 264.2799987792969, 271.4599914550781, 276.55999755859375, 277.3399963378906, 281.6600036621094, 284.05999755859375, 287.5299987792969, 289.5799865722656, 291.5299987792969, 293.8699951171875, 296.0899963378906, 302.80999755859375, 309.0799865722656, 313.5199890136719, 317.17999267578125, 319.7200012207031, 323.0299987792969, 327.0799865722656, 329.1199951171875, 331.7799987792969, 335.3800048828125, 337.489990234375, 340.42999267578125, 345.1300048828125, 348.5899963378906, 351.1600036621094, 354.75, 357.0, 358.739990234375, 360.239990234375, 364.739990234375, 365.9100036621094, 367.5, 369.8399963378906, 371.2799987792969, 373.260009765625, 378.2099914550781, 379.4200134277344, 404.7799987792969, 406.9100036621094, 410.1499938964844, 415.05999755859375, 419.05999755859375, 427.5199890136719, 431.69000244140625, 433.42999267578125, 436.1300048828125, 438.8299865722656], 'captions': ['melt alright', 'watching', 'dad stripping paint', 'gas bike frame 1979', 'yamaha xs 1100 got', 'engine rebuilt', 'stripping paint', 'priming bike', 'frame lot time ops', 'stuff bunch information', 'questions', 'stuff stuff bought', 'description use links', 'questions comment', 'brush stuff', 'literally bubbles middle', 'bring into', "here's got stripper", 'wash using', 'stripper removes chemical things', 'rust primer', 'stripping bike use', 'showed', 'mason jar', 'painted melted', 'brush pain', 'get hands burn', 'bad gloves', 'burn gloves', 'burn', 'careful using stuff', 'nasty stuff instead', 'making mess paint brush', 'use spray version', 'leo watches lot stuff', 'nasty paint', 'cbg said rust lot', 'hard rush mean', 'able get time ups', 'time', 'applause', 'use', 'says 30 minutes', 'soak get', 'corners type brush get', 'works', 'coat', 'stuff', 'rust borrow sodium', 'stuff awesome', 'spent think 6', 'rust used used little ah', "use he's little brush", 'brush', 'doing 15 20', 'minutes mean ate rest away', 'majority', 'rust alright', "primed pretty didn't", 'way hang set', 'board use', 'self etching primer', 'sides pretty step', "haven't leaned", 'get', 'touch areas', '400 grit sandpaper', 'rust oleum says use sand', 'little', 'looking good', 'little holes taped little', 'threads took screw', 'went into hole', 'screwed into lot paint', 'wet bed damp', 'screwed', 'clump screwed', 'way little', 'paint come threads', 'way flip threads clean', "here's hyperlapse spray pit", "alright here's frame primed", 'currently flash', 'little imperfection definitely', 'big mistake', 'think', "didn't go direction bar", 'primed 24', 'hours ready sanded alright', 'watching forget', 'subscribe videos'] } ``` ### Data Fields `video_id`: YouTube video ID `path`: Path to download the videos from the authors once proper access is accredited `category_1`: Highest level task category from WikiHow `category_2`: Second highest level task category from WikiHow `rank`: YouTube serach result rank of the video when querying the task `starts`: List corresponding to the end timestamps of each segment `ends`: List corresponding to the end timestamps of each segment `captions`: List of all the captions (one per segment) ### Data Splits All the data is contained in training split. The training set has 1M instances. ## Dataset Creation ### Curation Rationale From the paper: > we first start by acquiring a large list of activities using WikiHow1 – an online resource that contains 120,000 articles on How to ... for a variety of domains ranging from cooking to human relationships structured in a hierarchy. We are primarily interested in “visual tasks” that involve some interaction with the physical world (e.g. Making peanut butter, Pruning a tree) as compared to others that are more abstract (e.g. Ending a toxic relationship, Choosing a gift). To obtain predominantly visual tasks, we limit them to one of 12 categories (listed in Table 2). We exclude categories such as Relationships and Finance and Business, that may be more abstract. We further refine the set of tasks, by filtering them in a semi-automatic way. In particular, we restrict the primary verb to physical actions, such as make, build and change, and discard non-physical verbs, such as be, accept and feel. This procedure yields 23,611 visual tasks in total. > We search for YouTube videos related to the task by forming a query with how to preceding the task name (e.g. how to paint furniture). We choose videos that have English subtitles either uploaded manually, generated automatically by YouTube ASR, or generated automatically after translation from a different language by YouTube API. We improve the quality and consistency of the dataset, by adopting the following criteria. We restrict to the top 200 search results, as the latter ones may not be related to the query task. Videos with less than 100 views are removed as they are often of poor quality or are amateurish. We also ignore videos that have less than 100 words as that may be insufficient text to learn a good video-language embedding. Finally, we remove videos longer than 2,000 seconds. As some videos may appear in several tasks, we deduplicate videos based on YouTube IDs. However, note that the dataset may still contain duplicates if a video was uploaded several times or edited and re-uploaded. Nevertheless, this is not a concern at our scale. ### Source Data The source videos come from YouTube. #### Initial Data Collection and Normalization #### Who are the source language producers? YouTube uploaders. ### Annotations #### Annotation process Subtitles are generated or manually written. Note that the narrated captions have been processed. In fact, authors have removed a significant number of stop words which are not relevant for the learning of the text-video joint embedding. The list of stop words can be found here: https://github.com/antoine77340/howto100m/blob/master/stop_words.py. You can find the unprocessed caption file (i.e. with stop words) [here](https://www.rocq.inria.fr/cluster-willow/amiech/howto100m/raw_caption.zip). #### Who are the annotators? YouTube uploaders or machine-generated outputs. ### Personal and Sensitive Information ## Considerations for Using the Data ### Social Impact of Dataset ### Discussion of Biases ### Other Known Limitations ## Additional Information ### Dataset Curators Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, Josef Sivic ### Licensing Information Not specified. ### Citation Information ```bibtex @inproceedings{miech19howto100m, title={How{T}o100{M}: {L}earning a {T}ext-{V}ideo {E}mbedding by {W}atching {H}undred {M}illion {N}arrated {V}ideo {C}lips}, author={Miech, Antoine and Zhukov, Dimitri and Alayrac, Jean-Baptiste and Tapaswi, Makarand and Laptev, Ivan and Sivic, Josef}, booktitle={ICCV}, year={2019}, } ``` ### Contributions Thanks to [@VictorSanh](https://github.com/VictorSanh) for adding this dataset.

# HowTo100M 数据集卡片 ## 目录 [目录](#目录) [数据集描述](#数据集描述) [数据集概述](#数据集概述) [数据集预处理](#数据集预处理) [支持任务与基准榜单](#支持任务与基准榜单) [语言](#语言) [数据集结构](#数据集结构) [数据实例](#数据实例) [数据字段](#数据字段) [数据划分](#数据划分) [数据集构建](#数据集构建) [构建依据](#构建依据) [源数据](#源数据) [标注信息](#标注信息) [个人与敏感信息](#个人与敏感信息) [数据集使用注意事项](#数据集使用注意事项) [数据集的社会影响](#数据集的社会影响) [偏见讨论](#偏见讨论) [其他已知局限性](#其他已知局限性) [附加信息](#附加信息) [数据集维护者](#数据集维护者) [许可信息](#许可信息) [引用信息](#引用信息) [贡献致谢](#贡献致谢) ## 数据集描述 **主页:** [HowTo100M 官方页面](https://www.di.ens.fr/willow/research/howto100m/) **代码仓库:** [GitHub 仓库](https://github.com/antoine77340/howto100m) **相关论文:** [HowTo100M:通过观看亿级带旁白视频片段学习文本-视频嵌入](https://github.com/antoine77340/howto100m) **联系人:** Antoine Miech ### 数据集概述 HowTo100M是一个大型带旁白的视频数据集,重点聚焦教学视频——这类视频中创作者会以清晰讲解屏幕视觉内容为明确意图,演示复杂操作任务。该数据集总计包含: 1. 来自120万条YouTube视频(覆盖15年视频内容)的1.36亿条带字幕视频片段 2. 覆盖烹饪、手工制作、个人护理、园艺、健身等领域的2.3万个操作任务 每个视频均配有旁白字幕,这些字幕均从YouTube自动下载获取。 ### 数据集预处理 本数据集默认不包含视频文件。您需要按照数据集创建者在[此处](https://www.di.ens.fr/willow/research/howto100m/)提供的说明填写申请表,获取用户ID与密码,以从其服务器下载视频。 获取凭证后,您可通过对`path`列应用以下函数获取视频: python import requests USER_ID = "THE_USER_ID" PASSWORD = "THE_PASSWORD" def fetch_video(url): response = requests.get(url, auth=requests.auth.HTTPBasicAuth(USER_ID, PASSWORD)) return response.content ### 支持任务与基准榜单 `video-to-text`(视频到文本):本数据集可用于训练视频字幕生成(Video Captioning)模型,该任务的目标为根据给定视频预测对应的字幕描述。 ### 语言 所有字幕均为英文,来源包括YouTube手动上传的字幕,或自动语音识别(Automatic Speech Recognition, ASR)系统生成的字幕。 ## 数据集结构 ### 数据实例 HowTo100M中的每个实例对应一条独立视频,包含片段的起始与结束时间戳列表,以及每个片段对应的字幕。示例数据格式如下: json { 'video_id': 'AEytW9ScgCw', 'path': 'http://howto100m.inria.fr/dataset/AEytW9ScgCw.mp4', 'category_1': '汽车与其他车辆', 'category_2': '摩托车', 'rank': 108, 'task_description': '喷涂摩托车油箱', 'starts': [6.019999980926514, 9.449999809265137, ..., 433.42999267578125], 'ends': [12.539999961853027, 15.449999809265137, ..., 438.8299865722656], 'captions': ['melt alright', 'watching', ..., 'subscribe videos'] } ### 数据字段 - `video_id`:YouTube视频ID - `path`:获得官方授权后,从作者服务器下载视频的路径 - `category_1`:来自WikiHow的一级任务分类 - `category_2`:来自WikiHow的二级任务分类 - `rank`:搜索对应任务时,该视频在YouTube搜索结果中的排名 - `starts`:每个片段的起始时间戳列表 - `ends`:每个片段的结束时间戳列表 - `captions`:所有片段对应的字幕列表(每个片段一条字幕) ### 数据划分 所有数据均包含在训练划分中,训练集共包含100万个数据实例。 ## 数据集构建 ### 构建依据 引用论文内容: > 我们首先通过WikiHow获取大量任务列表——WikiHow是一个在线资源平台,包含12万篇“如何做……”类文章,覆盖从烹饪到人际关系等多个领域,且内容按层级结构组织。我们主要关注**视觉类任务**:即涉及与物理世界交互的任务(例如“制作花生酱”“修剪树木”),而非抽象类任务(例如“结束一段有毒关系”“挑选礼物”)。 > > 为了聚焦视觉任务,我们将任务限定在12个分类范围内(详见表2),排除人际关系、金融、商业等偏抽象的分类。我们进一步通过半自动方式筛选任务:仅保留包含“制作、构建、修改”等具象动作动词的任务,移除“是、接受、感受”等非具象动词。最终总计得到23611个视觉任务。 > > 我们通过在任务名称前添加“how to”构造搜索关键词,在YouTube上检索相关视频(例如“how to paint furniture”)。我们筛选带有英文字幕的视频,字幕来源包括手动上传的字幕、YouTube自动语音识别生成的字幕,或经YouTube API从其他语言翻译而来的字幕。 > > 为提升数据集质量与一致性,我们采用以下筛选规则:仅保留搜索结果前200条的视频(后续结果往往与查询任务相关性较低);移除播放量低于100的视频(这类视频通常质量较差或为业余制作);忽略字幕单词数少于100的视频(字幕文本量不足,难以用于学习优质的视频-文本联合嵌入模型);移除时长超过2000秒的视频。 > > 由于部分视频可能对应多个任务,我们基于YouTube ID对视频进行去重。但需注意:若同一视频被多次上传或编辑后重新上传,数据集仍可能存在重复条目,但在当前数据集规模下,这一问题并不构成显著影响。 ### 源数据 源视频均来自YouTube。 #### 初始数据收集与标准化 #### 源文本生产者是谁? YouTube上传者。 ### 标注信息 #### 标注流程 字幕由人工撰写或机器生成。需注意:旁白字幕已经过预处理——作者移除了大量与视频-文本联合嵌入学习无关的停用词。停用词列表可参见:https://github.com/antoine77340/howto100m/blob/master/stop_words.py。您也可在此处下载未经过停用词过滤的原始字幕文件:[点击下载](https://www.rocq.inria.fr/cluster-willow/amiech/howto100m/raw_caption.zip)。 #### 标注者是谁? YouTube上传者或机器自动生成的结果。 ### 个人与敏感信息 (本部分无额外说明) ## 数据集使用注意事项 ### 数据集的社会影响 (本部分无额外说明) ### 偏见讨论 (本部分无额外说明) ### 其他已知局限性 (本部分无额外说明) ## 附加信息 ### 数据集维护者 Antoine Miech、Dimitri Zhukov、Jean-Baptiste Alayrac、Makarand Tapaswi、Ivan Laptev、Josef Sivic ### 许可信息 未指定。 ### 引用信息 bibtex @inproceedings{miech19howto100m, title={HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips}, author={Miech, Antoine and Zhukov, Dimitri and Alayrac, Jean-Baptiste and Tapaswi, Makarand and Laptev, Ivan and Sivic, Josef}, booktitle={国际计算机视觉大会(ICCV)}, year={2019}, } ### 贡献致谢 感谢[@VictorSanh](https://github.com/VictorSanh) 为本数据集添加此卡片。
提供机构:
HuggingFaceM4
原始信息汇总

数据集概述

数据集名称: HowTo100M

数据集描述: HowTo100M是一个大规模的视频数据集,专注于教学视频,其中内容创作者通过视频详细解释视觉内容。该数据集包含来自1.2M YouTube视频的136M视频片段,覆盖23k种活动,涉及烹饪、手工艺、个人护理、园艺或健身等多个领域。

数据集结构:

  • 数据实例: 每个实例代表一个视频,包含视频ID、下载路径、类别、排名、任务描述以及视频片段的开始和结束时间及对应的字幕。
  • 数据字段:
    • video_id: YouTube视频ID
    • path: 视频下载路径
    • category_1: 最高级别任务类别
    • category_2: 第二高级别任务类别
    • rank: 视频在YouTube搜索结果中的排名
    • starts: 视频片段开始时间列表
    • ends: 视频片段结束时间列表
    • captions: 视频片段字幕列表
  • 数据分割: 所有数据包含在训练集中,共有1M实例。

数据集创建:

  • 来源数据: 数据来源于YouTube视频。
  • 注释过程: 字幕由YouTube上传者手动编写或通过YouTube自动语音识别系统生成。

使用注意事项:

  • 社会影响: 未详细说明。
  • 偏见讨论: 未详细说明。
  • 其他已知限制: 未详细说明。

附加信息:

  • 数据集管理员: Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, Josef Sivic
  • 许可证信息: 未指定。
  • 引用信息: bibtex @inproceedings{miech19howto100m, title={How{T}o100{M}: {L}earning a {T}ext-{V}ideo {E}mbedding by {W}atching {H}undred {M}illion {N}arrated {V}ideo {C}lips}, author={Miech, Antoine and Zhukov, Dimitri and Alayrac, Jean-Baptiste and Tapaswi, Makarand and Laptev, Ivan and Sivic, Josef}, booktitle={ICCV}, year={2019}, }

数据集特点

  • 大规模教学视频数据集,适用于视频字幕生成等任务。
  • 包含详细的视频元数据,如视频ID、类别、排名等。
  • 字幕由人工或自动系统生成,经过处理以去除不相关的停止词。
  • 数据集未明确提供许可证信息,使用时需注意版权问题。
搜集汇总
数据集介绍
main_image_url
构建方式
HowTo100M数据集的构建始于对WikiHow上大量活动列表的采集,这些活动涵盖从烹饪到人际关系等多种领域。通过筛选具有物理交互的视觉任务,并限制在一组特定的类别内,数据集进一步细化为23,611个视觉任务。随后,通过在YouTube上搜索与这些任务相关的视频,并选择包含英文字幕的视频作为数据源,这些字幕可能是手动上传的,也可能是YouTube自动语音识别或翻译API生成的。在确保视频质量的前提下,对搜索结果的前200个视频进行筛选,去除观看次数少于100次的视频以及长度超过2000秒的视频,最终形成包含100万实例的训练集。
使用方法
使用HowTo100M数据集时,用户需要先获取权限以从数据集提供者的服务器下载视频。一旦获取了用户ID和密码,用户可以通过提供视频路径的函数来下载视频。该数据集支持视频字幕嵌入学习的任务,如视频字幕生成,用户可以根据需要选择训练集进行模型的训练和评估。
背景与挑战
背景概述
HowTo100M数据集,创建于2019年,是由Antoine Miech等研究人员构建的一个大规模视频描述数据集。该数据集聚焦于具有明确教学意图的说明性视频,涵盖了烹饪、手工艺、个人护理、园艺或健身等领域的23k个活动。每个视频都有与之对应的字幕,这些字幕既可以是从YouTube自动下载的,也可以是自动语音识别系统的输出。HowTo100M的构建旨在通过观察数百万个解说视频片段,学习文本-视频联合嵌入,为视频字幕生成等任务提供支持。
当前挑战
在构建过程中,研究团队面临了多个挑战。首先,是如何从YouTube上获取与任务相关的视频,并筛选出高质量的视频内容。其次,是对视频字幕的预处理,包括去除停用词,以保证学习到的文本-视频联合嵌入的有效性。此外,数据集中可能存在的视频重复、个人和敏感信息处理、以及潜在的偏见问题,都是在使用该数据集时需要考虑的挑战。
常用场景
经典使用场景
HowTo100M数据集的经典使用场景在于构建和训练视频与文本的联合嵌入模型。该数据集通过整合大量的视频片段和对应的字幕,为模型提供了丰富的视觉与语言关联信息,使得研究者能够开展视频字幕生成、视频内容理解等任务。
解决学术问题
该数据集解决了学术研究中如何有效融合视频内容与文本描述的难题,为视频理解、视频生成文本描述以及视频与文本的交互式学习提供了强有力的数据支撑。其通过大规模的数据覆盖,为算法提供了充足的训练样本,提高了模型的泛化能力和准确度。
实际应用
在实际应用中,HowTo100M数据集可以被用于开发智能教育软件,辅助用户通过视频学习新技能;同时,它也可以用于提升搜索引擎的视频内容检索能力,使得用户能够更准确快速地找到所需的教学视频。
数据集最近研究
最新研究方向
HowTo100M数据集的研究方向主要集中在通过观看数百万个解说视频片段来学习文本-视频联合嵌入。该数据集的最新研究聚焦于如何通过视觉任务的视频内容与解说文本之间的关联,提升多模态理解与生成任务的表现。近期研究不仅探索了数据集中的文本-视频对应关系,还致力于挖掘其在视觉问答、视频字幕生成和视频内容理解等领域的应用潜力,为智能视频内容分析提供了强有力的数据支撑。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作