coco-captions
收藏魔搭社区2025-12-26 更新2025-01-11 收录
下载链接:
https://modelscope.cn/datasets/sentence-transformers/coco-captions
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for Coco Captions
This dataset is a collection of caption pairs given to the same image, collected from the Coco dataset. See [Coco](https://cocodataset.org/) for additional information.
This dataset can be used directly with Sentence Transformers to train embedding models.
Note that two captions for the same image do not strictly have the same semantic meaning.
## Dataset Subsets
### `pair` subset
* Columns: "caption1", "caption2"
* Column types: `str`, `str`
* Examples:
```python
{
'caption1': 'A clock that blends in with the wall hangs in a bathroom. ',
'caption2': 'A very clean and well decorated empty bathroom',
}
```
* Collection strategy: Reading the Coco Captions dataset from [embedding-training-data](https://huggingface.co/datasets/sentence-transformers/embedding-training-data), which has lists of duplicate captions. I've considered all adjacent captions as a positive pair, plus the last and first caption. So, e.g. 5 duplicate captions results in 5 duplicate pairs.
* Deduplified: No
# COCO 字幕数据集卡片
本数据集为从COCO(Common Objects in Context)数据集中采集的、针对同一张图像的字幕对集合。更多详细信息请查阅[COCO](https://cocodataset.org/)官方网站。
本数据集可直接配合Sentence Transformers模型用于嵌入模型的训练。
请注意,针对同一张图像的两条字幕未必具备严格一致的语义内涵。
## 数据集子集
### `pair` 子集
* 字段列表:"caption1"、"caption2"
* 字段类型:字符串(`str`)、字符串(`str`)
* 示例:
python
{
'caption1': 'A clock that blends in with the wall hangs in a bathroom. ',
'caption2': 'A very clean and well decorated empty bathroom',
}
* 采集策略:从[embedding-training-data](https://huggingface.co/datasets/sentence-transformers/embedding-training-data)加载COCO字幕数据集,该数据集内置重复字幕列表。本实现将所有相邻字幕视作正样本对,同时包含末条与首条字幕组成的对。例如,若存在5条重复字幕,则将生成5组字幕对。
* 去重状态:否
提供机构:
maas
创建时间:
2025-01-06
搜集汇总
数据集介绍

背景与挑战
背景概述
该数据集是从Coco数据集中收集的同一图像的不同描述对,用于训练嵌入模型。数据集包含'pair'子集,提供两列文本描述,未去重,适用于语义相似性任务。
以上内容由遇见数据集搜集并总结生成



