embedding-data/flickr30k_captions_quintets
收藏Hugging Face2022-08-02 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/embedding-data/flickr30k_captions_quintets
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
language:
- en
paperswithcode_id: embedding-data/flickr30k-captions
pretty_name: flickr30k-captions
---
# Dataset Card for "flickr30k-captions"
## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Usage Example](#usage-example)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Homepage:** [https://shannon.cs.illinois.edu/DenotationGraph/](https://shannon.cs.illinois.edu/DenotationGraph/)
- **Repository:** [More Information Needed](https://shannon.cs.illinois.edu/DenotationGraph/)
- **Paper:** [https://transacl.org/ojs/index.php/tacl/article/view/229/33](https://transacl.org/ojs/index.php/tacl/article/view/229/33)
- **Point of Contact:** [Peter Young](pyoung2@illinois.edu), [Alice Lai](aylai2@illinois.edu), [Micah Hodosh](mhodosh2@illinois.edu), [Julia Hockenmaier](juliahmr@illinois.edu)
### Dataset Summary
We propose to use the visual denotations of linguistic expressions (i.e. the set of images they describe) to define novel denotational similarity metrics, which we show to be at least as beneficial as distributional similarities for two tasks that require semantic inference. To compute these denotational similarities, we construct a denotation graph, i.e. a subsumption hierarchy over constituents and their denotations, based on a large corpus of 30K images and 150K descriptive captions.
Disclaimer: The team releasing Flickr30k did not upload the dataset to the Hub and did not write a dataset card. These steps were done by the Hugging Face team.
### Supported Tasks
- [Sentence Transformers](https://huggingface.co/sentence-transformers) training; useful for semantic search and sentence similarity.
### Languages
- English.
## Dataset Structure
Each example in the dataset contains quintets of similar sentences and is formatted as a dictionary with the key "set" and a list with the sentences as "value":
```
{"set": [sentence_1, sentence_2, sentence3, sentence4, sentence5]}
{"set": [sentence_1, sentence_2, sentence3, sentence4, sentence5]}
...
{"set": [sentence_1, sentence_2, sentence3, sentence4, sentence5]}
```
This dataset is useful for training Sentence Transformers models. Refer to the following post on how to train models using similar pairs of sentences.
### Usage Example
Install the 🤗 Datasets library with `pip install datasets` and load the dataset from the Hub with:
```python
from datasets import load_dataset
dataset = load_dataset("embedding-data/flickr30k-captions")
```
The dataset is loaded as a `DatasetDict` has the format:
```python
DatasetDict({
train: Dataset({
features: ['set'],
num_rows: 31783
})
})
```
Review an example `i` with:
```python
dataset["train"][i]["set"]
```
### Curation Rationale
[More Information Needed](https://shannon.cs.illinois.edu/DenotationGraph/)
### Source Data
#### Initial Data Collection and Normalization
[More Information Needed](https://shannon.cs.illinois.edu/DenotationGraph/)
#### Who are the source language producers?
[More Information Needed](https://shannon.cs.illinois.edu/DenotationGraph/)
### Annotations
#### Annotation process
[More Information Needed](https://shannon.cs.illinois.edu/DenotationGraph/)
#### Who are the annotators?
[More Information Needed](https://shannon.cs.illinois.edu/DenotationGraph/)
### Personal and Sensitive Information
[More Information Needed](https://shannon.cs.illinois.edu/DenotationGraph/)
## Considerations for Using the Data
### Social Impact of Dataset
[More Information Needed](https://shannon.cs.illinois.edu/DenotationGraph/)
### Discussion of Biases
[More Information Needed](https://shannon.cs.illinois.edu/DenotationGraph/)
### Other Known Limitations
[More Information Needed](https://shannon.cs.illinois.edu/DenotationGraph/)
## Additional Information
### Dataset Curators
[More Information Needed](https://shannon.cs.illinois.edu/DenotationGraph/)
### Licensing Information
[More Information Needed](https://shannon.cs.illinois.edu/DenotationGraph/)
### Citation Information
[More Information Needed](https://shannon.cs.illinois.edu/DenotationGraph/)
### Contributions
Thanks to [Peter Young](pyoung2@illinois.edu), [Alice Lai](aylai2@illinois.edu), [Micah Hodosh](mhodosh2@illinois.edu), [Julia Hockenmaier](juliahmr@illinois.edu) for adding this dataset.
提供机构:
embedding-data
原始信息汇总
数据集概述
数据集名称
- pretty_name: flickr30k-captions
数据集描述
数据集总结
- 该数据集包含30,000张图像和150,000个描述性标题,用于构建一个基于图像和标题的表示图,以计算语言表达的视觉表示相似度。
支持的任务
- 用于训练Sentence Transformers模型,适用于语义搜索和句子相似性任务。
语言
- 英语
数据集结构
数据实例
- 每个数据实例包含五个相似的句子,格式为字典,键为"set",值为句子列表。
数据字段
- 主要数据字段为"set",包含一组句子。
数据分割
- 数据集被分割为训练集,包含31,783个实例。
使用示例
- 通过🤗 Datasets库加载数据集,并可用于训练Sentence Transformers模型。
数据集创建
数据收集和规范化
- 信息待补充
源语言生产者
- 信息待补充
注释过程
- 信息待补充
注释者
- 信息待补充
个人和敏感信息
- 信息待补充
使用数据的考虑
数据集的社会影响
- 信息待补充
偏见讨论
- 信息待补充
其他已知限制
- 信息待补充
附加信息
数据集管理员
- 信息待补充
许可信息
- 许可证: MIT
引用信息
- 信息待补充
贡献者
- 感谢Peter Young, Alice Lai, Micah Hodosh, Julia Hockenmaier添加此数据集。



