flickr30k-captions
收藏魔搭社区2025-11-12 更新2025-01-11 收录
下载链接:
https://modelscope.cn/datasets/sentence-transformers/flickr30k-captions
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for Flickr30k Captions
This dataset is a collection of caption pairs given to the same image, collected from Flickr30k. See [Flickr30k](https://shannon.cs.illinois.edu/DenotationGraph/) for additional information.
This dataset can be used directly with Sentence Transformers to train embedding models.
Note that two captions for the same image do not strictly have the same semantic meaning.
## Dataset Subsets
### `pair` subset
* Columns: "caption1", "caption2"
* Column types: `str`, `str`
* Examples:
```python
{
'caption1': 'A large structure has broken and is laying in a roadway.',
'caption2': 'A man stands on wooden supports and surveys damage.',
}
```
* Collection strategy: Reading the Flickr30k Captions dataset from [embedding-training-data](https://huggingface.co/datasets/sentence-transformers/embedding-training-data), which has lists of duplicate captions. I've considered all adjacent captions as a positive pair, plus the last and first caption. So, e.g. 5 duplicate captions results in 5 duplicate pairs.
* Deduplified: No
# Flickr30k 字幕数据集卡片
本数据集源自Flickr30k,收录了针对同一图像的多组字幕对。更多信息可参阅[Flickr30k](https://shannon.cs.illinois.edu/DenotationGraph/)。
本数据集可直接配合Sentence Transformers用于训练嵌入模型(embedding models)。
请注意:针对同一图像的两条字幕未必在语义上完全等同。
## 数据集子集
### `pair` 子集
* 字段:`caption1`、`caption2`
* 字段类型:`str`、`str`
* 示例:
python
{
'caption1': '一处大型结构发生坍塌,横卧在道路上。',
'caption2': '一名男子站在木质支架上检视损毁情况。',
}
* 采集策略:从[embedding-training-data](https://huggingface.co/datasets/sentence-transformers/embedding-training-data)中读取Flickr30k字幕数据集,该数据集包含重复字幕列表。本实现将所有相邻字幕视为正样本对,同时将首条与末条字幕也纳入正样本对。例如,若存在5条重复字幕,则可生成5组正样本对。
* 去重处理:否
提供机构:
maas
创建时间:
2025-01-06



