coco-captions

Name: coco-captions
Creator: maas
Published: 2025-12-26 16:19:53
License: 暂无描述

魔搭社区2025-12-26 更新2025-01-11 收录

下载链接：

https://modelscope.cn/datasets/sentence-transformers/coco-captions

下载链接

链接失效反馈

官方服务：

资源简介：

# Dataset Card for Coco Captions This dataset is a collection of caption pairs given to the same image, collected from the Coco dataset. See [Coco](https://cocodataset.org/) for additional information. This dataset can be used directly with Sentence Transformers to train embedding models. Note that two captions for the same image do not strictly have the same semantic meaning. ## Dataset Subsets ### `pair` subset * Columns: "caption1", "caption2" * Column types: `str`, `str` * Examples: ```python { 'caption1': 'A clock that blends in with the wall hangs in a bathroom. ', 'caption2': 'A very clean and well decorated empty bathroom', } ``` * Collection strategy: Reading the Coco Captions dataset from [embedding-training-data](https://huggingface.co/datasets/sentence-transformers/embedding-training-data), which has lists of duplicate captions. I've considered all adjacent captions as a positive pair, plus the last and first caption. So, e.g. 5 duplicate captions results in 5 duplicate pairs. * Deduplified: No

# COCO 字幕数据集卡片本数据集为从COCO（Common Objects in Context）数据集中采集的、针对同一张图像的字幕对集合。更多详细信息请查阅[COCO](https://cocodataset.org/)官方网站。本数据集可直接配合Sentence Transformers模型用于嵌入模型的训练。请注意，针对同一张图像的两条字幕未必具备严格一致的语义内涵。 ## 数据集子集 ### `pair` 子集 * 字段列表："caption1"、"caption2" * 字段类型：字符串（`str`）、字符串（`str`） * 示例： python { 'caption1': 'A clock that blends in with the wall hangs in a bathroom. ', 'caption2': 'A very clean and well decorated empty bathroom', } * 采集策略：从[embedding-training-data](https://huggingface.co/datasets/sentence-transformers/embedding-training-data)加载COCO字幕数据集，该数据集内置重复字幕列表。本实现将所有相邻字幕视作正样本对，同时包含末条与首条字幕组成的对。例如，若存在5条重复字幕，则将生成5组字幕对。 * 去重状态：否

提供机构：

maas

创建时间：

2025-01-06

搜集汇总

数据集介绍