five

OpenNLPLab/FAVDBench

收藏
Hugging Face2023-12-06 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/OpenNLPLab/FAVDBench
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 language: - en - zh tags: - FAVD - FAVDBench - Video Description - Audio Description - Audible Video Description - Fine-grained Description size_categories: - 10K<n<100K --- <div align="center"> <h1> FAVDBench: Fine-grained Audible Video Description </h1> </div> <p align="center"> 🤗 <a href="https://huggingface.co/datasets/OpenNLPLab/FAVDBench" target="_blank">Hugging Face</a> • 🏠 <a href="https://github.com/OpenNLPLab/FAVDBench" target="_blank">GitHub</a> • 🤖 <a href="https://openxlab.org.cn/datasets/OpenNLPLab/FAVDBench" target="_blank">OpenDataLab</a> • 💬 <a href="https://forms.gle/5S3DWpBaV1UVczkf8" target="_blank">Apply Dataset</a> </p> [[`CVPR2023`]](https://openaccess.thecvf.com/content/CVPR2023/html/Shen_Fine-Grained_Audible_Video_Description_CVPR_2023_paper.html) [[`Project Page`]](http://www.avlbench.opennlplab.cn/papers/favd) [[`arXiv`]](https://arxiv.org/abs/2303.15616) [[`Demo`]](https://www.youtube.com/watch?v=iWJvTB-bTWk&ab_channel=OpenNLPLab)[[`BibTex`]](#Citation) [[`中文简介`]](https://mp.weixin.qq.com/s/_M57ZuOHH0UdwB6i9osqOA) - [Introduction 简介](#introduction-简介) - [Files 文件](#files-文件) - [MD5 checksum](#md5-checksum) - [Updates](#updates) - [License](#license) - [Citation](#citation) ## Introduction 简介 在CVPR2023中我们提出了精细化音视频描述任务(Fine-grained Audible Video Description, FAVD)该任务旨在提供有关可听视频的详细文本描述,包括每个对象的外观和空间位置、移动对象的动作以及视频中的声音。我们同是也为社区贡献了第一个精细化音视频描述数据集FAVDBench。对于每个视频片段,我们不仅提供一句话的视频概要,还提供4-6句描述视频的视觉细节和1-2个音频相关描述,且所有的标注都有中英文双语。 At CVPR2023, we introduced the task of Fine-grained Audible Video Description (FAVD). This task aims to provide detailed textual descriptions of audible videos, including the appearance and spatial positions of each object, the actions of moving objects, and the sounds within the video. Additionally, we contributed the first fine-grained audible video description dataset, FAVDBench, to the community. For each video segment, we offer not only a single-sentence video summary but also 4-6 sentences describing the visual details of the video and 1-2 audio-related descriptions, all annotated in both Chinese and English. ## Files 文件 * `meta`: metadata for raw videos * `train`, `val`, `test`: train, val, test split * `ytid`: youtube id * `start`: vid segments starting time in seconds * `end`: vid segments ending time in seconds * `videos` , `audios` : raw video and audio segments * `train` : train split * `val`: validation split * `test`: test split * **📢📢📢 Please refer to [Apply Dataset](https://forms.gle/5S3DWpBaV1UVczkf8) to get raw video/audio data** * `annotations_en.json` : annotated descirptions in English * `id`: unique data (video segment) id * `description`: audio-visual descriptioins * `annotations_en.json` : annotated descirptions in Chinese * `id`: unique data (video segment) id * `cap`, `des`: audio-visual descriptioins * `dcount`: count of descriptions * `experiments`: expiermental files to replicate the results outlined in the paper. * **📢📢📢 Please refer to [GitHub Repo](https://github.com/OpenNLPLab/FAVDBench) to get related data** ## MD5 checksum | file | md5sum | | :-------------------------: | :------------------------------: | | `videos/train.zip` | 41ddad46ffac339cb0b65dffc02eda65 | | `videos/val.zip` | 35291ad23944d67212c6e47b4cc6d619 | | `videos/test.zip` | 07046d205837d2e3b1f65549fc1bc4d7 | | `audios/train.zip` | 50cc83eebd84f85e9b86bbd2a7517f3f | | `audios/val.zip` | 73995c5d1fcef269cc90be8a8ef6d917 | | `audios/test.zip` | f72085feab6ca36060a0a073b31e8acc | ## Updates **Latest Version: Jan 9, 2023. Public V0.1** 1. v0.1 <Jan 9, 2023>: initial publication ## License The community usage of FAVDBench model & code requires adherence to [Apache 2.0](https://github.com/OpenNLPLab/FAVDBench/blob/main/LICENSE). The FAVDBench model & code supports commercial use. ## Citation If you use FAVD or FAVDBench in your research, please use the following BibTeX entry. ``` @InProceedings{Shen_2023_CVPR, author = {Shen, Xuyang and Li, Dong and Zhou, Jinxing and Qin, Zhen and He, Bowen and Han, Xiaodong and Li, Aixuan and Dai, Yuchao and Kong, Lingpeng and Wang, Meng and Qiao, Yu and Zhong, Yiran}, title = {Fine-Grained Audible Video Description}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2023}, pages = {10585-10596} } ```
提供机构:
OpenNLPLab
原始信息汇总

FAVDBench: Fine-grained Audible Video Description

数据集概述

FAVDBench是一个精细化音视频描述数据集,旨在提供有关可听视频的详细文本描述,包括每个对象的外观和空间位置、移动对象的动作以及视频中的声音。该数据集由OpenNLPLab在CVPR2023中提出,并提供了中英文双语标注。

数据集内容

  • 元数据:包含原始视频的元数据,分为训练集、验证集和测试集。

    • meta
      • train, val, test:数据分割
      • ytid:YouTube视频ID
      • start:视频片段开始时间(秒)
      • end:视频片段结束时间(秒)
  • 视频和音频片段:原始视频和音频片段,分为训练集、验证集和测试集。

    • videos, audios
      • train, val, test:数据分割
  • 标注文件

    • annotations_en.json:英文标注描述
      • id:唯一数据ID
      • description:视听描述
    • annotations_zh.json:中文标注描述
      • id:唯一数据ID
      • cap, des:视听描述
      • dcount:描述数量
  • 实验文件:用于复现论文结果的实验文件。

    • experiments

MD5校验和

文件 MD5校验和
videos/train.zip 41ddad46ffac339cb0b65dffc02eda65
videos/val.zip 35291ad23944d67212c6e47b4cc6d619
videos/test.zip 07046d205837d2e3b1f65549fc1bc4d7
audios/train.zip 50cc83eebd84f85e9b86bbd2a7517f3f
audios/val.zip 73995c5d1fcef269cc90be8a8ef6d917
audios/test.zip f72085feab6ca36060a0a073b31e8acc

更新记录

  • 最新版本:2023年1月9日,Public V0.1
    • v0.1 <2023年1月9日>:初始发布

许可证

FAVDBench模型和代码的使用需遵守Apache 2.0许可证,支持商业使用。

引用

如果使用FAVD或FAVDBench进行研究,请使用以下BibTeX条目:

@InProceedings{Shen_2023_CVPR, author = {Shen, Xuyang and Li, Dong and Zhou, Jinxing and Qin, Zhen and He, Bowen and Han, Xiaodong and Li, Aixuan and Dai, Yuchao and Kong, Lingpeng and Wang, Meng and Qiao, Yu and Zhong, Yiran}, title = {Fine-Grained Audible Video Description}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2023}, pages = {10585-10596} }

搜集汇总
数据集介绍
main_image_url
背景与挑战
背景概述
FAVDBench是CVPR2023提出的精细化音视频描述(FAVD)任务的基准数据集,专注于为可听视频提供详细文本描述,涵盖对象外观、空间位置、动作和声音。数据集包含10,000个视频片段,每个片段提供中英文双语标注,包括一句话概要、4-6句视觉细节和1-2句音频描述,适用于多模态语言处理研究。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作