HuggingFaceM4/TGIF

Name: HuggingFaceM4/TGIF
Creator: HuggingFaceM4
Published: 2022-10-25 10:25:38
License: 暂无描述

Hugging Face2022-10-25 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/HuggingFaceM4/TGIF

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - expert-generated language_creators: - crowdsourced language: - en license: - other multilinguality: - monolingual pretty_name: TGIF size_categories: - 100K<n<1M source_datasets: - original task_categories: - question-answering - visual-question-answering task_ids: - closed-domain-qa --- # Dataset Card for [Dataset Name] ## Table of Contents - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** http://raingo.github.io/TGIF-Release/ - **Repository:** https://github.com/raingo/TGIF-Release - **Paper:** https://arxiv.org/abs/1604.02748 - **Point of Contact:** mailto: yli@cs.rochester.edu ### Dataset Summary The Tumblr GIF (TGIF) dataset contains 100K animated GIFs and 120K sentences describing visual content of the animated GIFs. The animated GIFs have been collected from Tumblr, from randomly selected posts published between May and June of 2015. We provide the URLs of animated GIFs in this release. The sentences are collected via crowdsourcing, with a carefully designed annotation interface that ensures high quality dataset. We provide one sentence per animated GIF for the training and validation splits, and three sentences per GIF for the test split. The dataset shall be used to evaluate animated GIF/video description techniques. ### Languages The captions in the dataset are in English. ## Dataset Structure ### Data Fields - `video_path`: `str` "https://31.media.tumblr.com/001a8b092b9752d260ffec73c0bc29cd/tumblr_ndotjhRiX51t8n92fo1_500.gif" -`video_bytes`: `large_bytes` video file in bytes format - `en_global_captions`: `list_str` List of english captions describing the entire video ### Data Splits | |train |validation| test | Overall | |-------------|------:|---------:|------:|------:| |# of GIFs|80,000 |10,708 |11,360 |102,068 | ### Annotations Quoting [TGIF paper](https://arxiv.org/abs/1604.02748): \ "We annotated animated GIFs with natural language descriptions using the crowdsourcing service CrowdFlower. We carefully designed our annotation task with various quality control mechanisms to ensure the sentences are both syntactically and semantically of high quality. A total of 931 workers participated in our annotation task. We allowed workers only from Australia, Canada, New Zealand, UK and USA in an effort to collect fluent descriptions from native English speakers. Figure 2 shows the instructions given to the workers. Each task showed 5 animated GIFs and asked the worker to describe each with one sentence. To promote language style diversity, each worker could rate no more than 800 images (0.7% of our corpus). We paid 0.02 USD per sentence; the entire crowdsourcing cost less than 4K USD. We provide details of our annotation task in the supplementary material." ### Personal and Sensitive Information Nothing specifically mentioned in the paper. ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Licensing Information This dataset is provided to be used for approved non-commercial research purposes. No personally identifying information is available in this dataset. ### Citation Information ```bibtex @InProceedings{tgif-cvpr2016, author = {Li, Yuncheng and Song, Yale and Cao, Liangliang and Tetreault, Joel and Goldberg, Larry and Jaimes, Alejandro and Luo, Jiebo}, title = "{TGIF: A New Dataset and Benchmark on Animated GIF Description}", booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2016} } ``` ### Contributions Thanks to [@leot13](https://github.com/leot13) for adding this dataset.

annotations_creators: - 专家生成 language_creators: - 众包 language: - 英语（en） license: - 其他 multilinguality: - 单语言 pretty_name: TGIF size_categories: - 100K<n<1M source_datasets: - 原生数据集 task_categories: - 问答（question-answering） - 视觉问答（visual-question-answering） task_ids: - 封闭域问答（closed-domain-qa） --- # [数据集名称]数据集卡片 ## 目录 - [目录](#目录) - [数据集描述](#数据集描述) - [数据集概述](#数据集概述) - [语言说明](#语言说明) - [数据集结构](#数据集结构) - [数据字段](#数据字段) - [数据划分](#数据划分) - [数据集构建](#数据集构建) - [个人与敏感信息](#个人与敏感信息) - [数据集使用注意事项](#数据集使用注意事项) - [数据集的社会影响](#数据集的社会影响) - [偏差讨论](#偏差讨论) - [其他已知局限性](#其他已知局限性) - [附加信息](#附加信息) - [许可信息](#许可信息) - [引用信息](#引用信息) - [贡献者](#贡献者) ## 数据集描述 - **主页**：http://raingo.github.io/TGIF-Release/ - **代码仓库**：https://github.com/raingo/TGIF-Release - **相关论文**：https://arxiv.org/abs/1604.02748 - **联系人**：mailto: yli@cs.rochester.edu ### 数据集概述 Tumblr动图GIF（TGIF）数据集包含10万条动图GIF以及12万条描述动图视觉内容的语句。所收录的动图GIF采集自Tumblr平台2015年5月至6月期间发布的随机精选帖文。本版本仅提供动图GIF的链接地址。相关语句通过众包方式采集，并采用精心设计的标注界面以保障数据集的高质量。训练集与验证集的每条动图对应1条描述语句，测试集的每条动图则对应3条描述语句。本数据集可用于评估动图/视频描述技术。 ### 语言说明数据集中的字幕文本均为英语。 ## 数据集结构 ### 数据字段 - `video_path`: `str` 字符串类型，示例值为"https://31.media.tumblr.com/001a8b092b9752d260ffec73c0bc29cd/tumblr_ndotjhRiX51t8n92fo1_500.gif"，即动图的URL路径 - `video_bytes`: `large_bytes` 字节格式的视频文件 - `en_global_captions`: `list_str` 描述整个视频的英语字幕列表 ### 数据划分 | | 训练集 | 验证集 | 测试集 | 总计 | |-------------|-------:|-------:|-------:|-----:| | 动图数量 | 80,000 |10,708 |11,360 |102,068 | ### 标注说明引用[TGIF论文](https://arxiv.org/abs/1604.02748)的内容如下： > "我们通过众包平台CrowdFlower，使用自然语言描述对动图GIF进行标注。我们为该标注任务设计了包含多种质量控制机制的流程，以确保生成的语句在语法与语义层面均具备高质量。本次标注任务共有931名参与者，为确保采集到以英语为母语的使用者所撰写的流畅描述，我们仅允许来自澳大利亚、加拿大、新西兰、英国及美国的标注者参与。图2展示了向标注者提供的操作指引。每个标注任务会展示5条动图，并要求标注者为每条动图撰写一句描述语句。为促进语言风格的多样性，每位标注者最多只能标注800张图片（占总语料库的0.7%）。每条语句的标注报酬为0.02美元，整个众包任务的总成本不足4000美元。我们在补充材料中提供了标注任务的详细细节。" ### 个人与敏感信息论文中未提及相关个人或敏感信息。 ## 数据集使用注意事项 ### 数据集的社会影响 [需补充更多信息] ### 偏差讨论 [需补充更多信息] ### 其他已知局限性 [需补充更多信息] ## 附加信息 ### 许可信息本数据集仅可用于经批准的非商业研究用途，数据集中未包含任何个人身份识别信息。 ### 引用信息 bibtex @InProceedings{tgif-cvpr2016, author = {Li, Yuncheng and Song, Yale and Cao, Liangliang and Tetreault, Joel and Goldberg, Larry and Jaimes, Alejandro and Luo, Jiebo}, title = "{TGIF: 面向动图描述的新型数据集与基准测试集}", booktitle = {IEEE计算机视觉与模式识别会议（CVPR)}, month = {June}, year = {2016} } ### 贡献者感谢[@leot13](https://github.com/leot13)为本数据集添加了相关内容。

提供机构：

HuggingFaceM4

原始信息汇总

数据集概述

数据集名称

TGIF

数据集摘要

TGIF数据集包含100,000个动画GIF和120,000个描述这些动画GIF视觉内容的句子。这些动画GIF是从Tumblr上随机选取的2015年5月至6月发布的帖子中收集的。数据集提供了动画GIF的URL。句子通过众包收集，使用精心设计的标注界面确保数据集的高质量。训练和验证集每GIF提供一个句子，测试集每GIF提供三个句子。

语言

数据集中的标注语言为英语。

数据集结构

数据字段
- video_path: 字符串，动画GIF的URL。
- video_bytes: 大型字节，动画GIF的文件格式。
- en_global_captions: 字符串列表，描述整个视频的英文标注。
数据分割
- 训练集: 80,000个GIF
- 验证集: 10,708个GIF
- 测试集: 11,360个GIF
- 总计: 102,068个GIF

数据集创建

标注是通过CrowdFlower众包服务完成的，共有931名来自澳大利亚、加拿大、新西兰、英国和美国的工人参与。每项任务展示5个动画GIF，要求工人用一句话描述每个GIF。每个句子的报酬为0.02美元，整个众包成本不到4,000美元。

许可证信息

该数据集仅供批准的非商业研究用途使用。

引用信息

bibtex @InProceedings{tgif-cvpr2016, author = {Li, Yuncheng and Song, Yale and Cao, Liangliang and Tetreault, Joel and Goldberg, Larry and Jaimes, Alejandro and Luo, Jiebo}, title = "{TGIF: A New Dataset and Benchmark on Animated GIF Description}", booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2016} }

搜集汇总

数据集介绍

构建方式

TGIF数据集的构建是通过精心设计的众包任务，在 CrowdFlower 平台上进行。该任务要求工人为每个动画GIF提供一句自然语言描述，并通过多种质量控制机制确保句子的语法和语义质量。共931名工人参与，仅限澳大利亚、加拿大、新西兰、英国和美国的工人参与，以保证描述的流畅性。每个工人最多标注800个图像，以促进语言风格的多样性。众包成本低于4000美元。

特点

TGIF数据集包含100K个动画GIF和120K个描述动画GIF视觉内容的句子。这些GIF是从2015年5月至6月间Tumblr上随机选择的帖子中收集而来。句子通过众包方式收集，保证了数据集的高质量。数据集分为训练集、验证集和测试集，其中训练集和验证集每个GIF提供一句描述，测试集每个GIF提供三句描述。该数据集适用于评估动画GIF/视频描述技术。

使用方法

使用TGIF数据集时，研究者可以访问数据集的URL，获取动画GIF的文件路径、文件字节数和英文描述列表。数据集分为三个部分：训练集包含80,000个GIF，验证集包含10,708个，测试集包含11,360个。数据集可用于非商业性研究目的，使用时需遵循提供的许可协议，并在出版物中引用数据集的相关信息。

背景与挑战

背景概述

TGIF数据集，全称为Tumblr GIF (TGIF)数据集，是由Yuncheng Li、Yale Song等研究人员于2016年创建的，旨在为动画GIF描述技术提供评估标准。该数据集包含了从Tumblr平台随机选取的10万段动画GIF及其对应的12万条描述性句子。这些描述性句子是通过众包方式收集的，确保了数据集的高质量。TGIF数据集的创建，为视觉问答和闭域问答任务提供了新的研究领域，并在计算机视觉和自然语言处理领域产生了广泛影响。

当前挑战

在构建TGIF数据集的过程中，研究人员面临了多个挑战。首先，确保众包过程中描述句子的质量和多样性是一个关键挑战。其次，由于动画GIF的视觉内容复杂多变，如何准确描述这些内容也是一个挑战。此外，数据集的规模和多样性带来了数据处理和存储上的挑战。在使用该数据集时，还需要考虑可能存在的社会影响、偏见以及其他潜在局限性，这些因素都可能对研究结果的普遍性和可靠性产生影响。

常用场景

经典使用场景

在视觉问答领域，TGIF数据集以其丰富的动画GIF视觉内容与对应的英文描述，成为训练和评估动画GIF描述技术的经典资源。该数据集通过精确设计的众包标注任务，确保了每张GIF均配有高质量的自然语言描述，使得它适用于深度学习模型的训练，尤其是针对视频描述的模型。

解决学术问题

TGIF数据集解决了视频描述任务中缺乏针对动态图像描述的标注数据问题，为动画GIF的理解和描述提供了标准化评测基准。其细致的标注流程和高质量的描述数据，有助于学术研究者探索和提升模型在视觉内容理解与自然语言生成方面的性能。

衍生相关工作

基于TGIF数据集，学术界衍生了众多经典工作，包括但不限于改进动画描述算法、探索众包标注的质量控制机制，以及将TGIF作为基础数据集来训练和评估跨模态理解和生成模型的研究。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集