High-Level Dataset

Name: High-Level Dataset
Creator: 马耳他大学语言与语言技术研究所
Published: 2023-09-25 15:37:20
License: 暂无描述

arXiv2023-09-25 更新2024-07-24 收录

下载链接：

https://github.com/michelecafagna26/HL-dataset

下载链接

链接失效反馈

官方服务：

资源简介：

High-Level Dataset是由马耳他大学语言与语言技术研究所创建的数据集，扩展了14,997张COCO数据集的图像，提供了134,973条人类标注的高级描述。这些描述分为场景、动作和理由三个类别，旨在捕捉人类对图像内容的解释和预期。数据集还包括了由独立读者收集的置信度评分，以及通过合成方式生成的叙述性描述。该数据集适用于视觉与语言模型的测试和微调，特别是在需要高级描述的场景中。

The High-Level Dataset is a curated dataset developed by the Institute of Languages and Language Technologies at the University of Malta. It includes 14,997 images sourced from the COCO dataset and provides 134,973 human-annotated high-level descriptions for these images. These descriptions are categorized into three categories: scene, action, and rationale, with the objective of capturing human interpretations and expectations of image content. The dataset also includes confidence scores collected by independent readers, as well as synthetically generated narrative descriptions. This dataset is well-suited for testing and fine-tuning visual-language models, particularly in scenarios requiring high-level image descriptions.

提供机构：

马耳他大学语言与语言技术研究所

创建时间：

2023-02-24

原始信息汇总

高级别数据集 (HL Dataset)

数据集概述

数据来源: 从COCO数据集中选取的14997张图片。
描述类型: 包含场景、动作和理由三个轴的高级别描述。
描述数量: 总共134973条众包描述（每个轴3条描述），以及约749984条基于对象的描述。

数据收集方式

场景: 询问图片的拍摄地点。
动作: 询问主体在做什么。
理由: 询问主体为什么这样做。

描述特点

抽象概念: 高级别描述捕捉人类对图像的解释，包含与物理对象不直接相关的抽象概念。
置信度评分: 每条高级别描述都附带一个置信度评分，评分范围为1-5，表示描述与常识的接近程度。

数据集扩展

HL-Narratives: 通过结合三个轴生成图像的叙述，形成一个合成数据集。

数据结构

数据实例: 每个实例包含文件名、描述、置信度评分、纯度评分和多样性评分。
数据字段:
- file_name: 原始COCO文件名。
- captions: 包含所有描述的字典，每个轴通过轴名访问。
- confidence: 包含描述置信度评分的字典。
- purity score: 包含描述纯度评分的字典，基于Bleurt的语义相似度。
- diversity score: 包含描述多样性评分的字典，基于Self-BLEU的词汇多样性。

数据分割

训练-验证集: 13498张图片，121482条高级别描述。
测试集: 1499张图片，13491条高级别描述。

数据集管理者

Michele Cafagna

许可信息

图像和基于对象的描述: 遵循COCO使用条款。
其余注释: 遵循Apache-2.0许可。

引用信息

BibTeX @inproceedings{cafagna2023hl, title={{HL} {D}ataset: {V}isually-grounded {D}escription of {S}cenes, {A}ctions and {R}ationales}, author={Cafagna, Michele and van Deemter, Kees and Gatt, Albert}, booktitle={Proceedings of the 16th International Natural Language Generation Conference (INLG23)}, address = {Prague, Czech Republic}, year={2023} }

搜集汇总

数据集介绍

构建方式

High-Level Dataset (HLDataset) 的构建方法涉及从 COCO 数据集中选取 14997 张图像，并收集了 134,973 个由人类标注的高水平描述。这些描述分为三个轴：场景、动作和理由。每个图像都有三个不同轴上的标注，共计 134,973 个描述。此外，还收集了独立读者对每个高水平描述的置信度评分，以及通过组合三个轴生成的一系列叙事性描述。该数据集旨在为视觉和语言模型提供高水平描述，以促进对场景的更深入理解。

使用方法

High-Level Dataset 可用于多种视觉和语言建模任务，包括图像描述生成、视觉问答和视觉故事讲述。为了使用该数据集，研究者需要先将其下载到本地，并按照数据集文档中的说明进行预处理。然后，可以使用现有的视觉和语言模型进行微调，以生成高水平描述或叙事性描述。此外，置信度评分可以用于评估描述的可靠性和广泛性，并可以用来识别数据集中的困难样本。最后，该数据集可以与其他数据集结合使用，以构建更复杂的视觉和语言模型。

背景与挑战

背景概述

High-Level Dataset (HLDataset) 是由马耳他大学语言学院和乌得勒支大学信息与计算科学学院的研究人员于 2023 年共同创建的。该数据集旨在扩展视觉与语言 (V&L) 模型的训练与测试，通过收集 14,997 张来自 COCO 数据集的图像，并为之配对 134,973 条人类标注的高水平描述，这些描述涵盖了场景、动作和原因三个维度。HLDataset 的创建填补了当前图像描述数据集在高级描述方面的空白，为 V&L 模型的理解和生成提供了新的视角和资源。该数据集的发布对于推动 V&L 模型在理解高级描述、场景推理和视觉共通感推理方面的发展具有重要意义。

当前挑战

HLDataset 的创建和利用面临着一些挑战。首先，高水平描述的收集需要考虑个人经验和常识假设的影响，这可能导致描述的多样性和主观性。其次，如何将高水平描述与低水平描述（例如物体描述）相结合，以生成更加自然和丰富的图像描述，是一个技术上的挑战。此外，HLDataset 的应用也面临着如何评估模型在生成高水平描述时的性能和可靠性等问题。为了解决这些挑战，研究人员需要进行更深入的探索和研究，例如通过引入信心分数来评估描述的可靠性，以及通过数据增强和生成实验来提高模型生成高水平描述的能力。

常用场景

经典使用场景

High-Level Dataset (HLD) is a pivotal resource for the field of Vision and Language (V&L) modeling, providing a new dimension to image captioning by focusing on high-level descriptions such as scenes, actions, and rationales. It extends the COCO dataset with 14997 images, each paired with 134,973 human-annotated captions. This dataset is instrumental for training models to generate captions that not only describe the visible objects but also interpret the context and intentions behind the depicted scenes. The dataset is particularly useful for tasks that require understanding the relationship between visual content and human perception, such as visual storytelling and commonsense reasoning.

解决学术问题

The HLD addresses the limitation of existing captioning datasets that predominantly focus on object-centric descriptions. By introducing high-level captions, the dataset enables research into how models can interpret and generate descriptions that are more aligned with human understanding and interpretation of scenes. This shift from object-centric to high-level descriptions is crucial for developing models capable of generating more nuanced and contextually rich captions. The dataset also includes confidence scores for each caption, which helps in identifying the degree of commonality in human interpretations, aiding in the development of models that better capture shared human experiences and assumptions.

实际应用

The practical applications of the HLD are far-reaching. It can be used to enhance image captioning systems in various domains, such as social media platforms, where more descriptive and contextually rich captions can improve accessibility and user engagement. Additionally, the dataset can be utilized in the development of assistive technologies for visually impaired individuals, providing descriptions that go beyond simple object recognition. The narrative-like captions generated using the HLD can also be applied in educational settings to help learners understand and interpret visual content more deeply.

数据集最近研究