conceptual_captions

Name: conceptual_captions
Creator: maas
Published: 2026-01-06 16:38:02
License: 暂无描述

魔搭社区2026-01-06 更新2025-07-12 收录

下载链接：

https://modelscope.cn/datasets/google-research-datasets/conceptual_captions

下载链接

链接失效反馈

官方服务：

资源简介：

# Dataset Card for Conceptual Captions ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Dataset Preprocessing](#dataset-preprocessing) - [Supported Tasks](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-instances) - [Data Splits](#data-instances) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) ## Dataset Description - **Homepage:** [Conceptual Captions homepage](https://ai.google.com/research/ConceptualCaptions/) - **Repository:** [Conceptual Captions repository](https://github.com/google-research-datasets/conceptual-captions) - **Paper:** [Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning](https://www.aclweb.org/anthology/P18-1238/) - **Leaderboard:** [Conceptual Captions leaderboard](https://ai.google.com/research/ConceptualCaptions/competition?active_tab=leaderboard)https://ai.google.com/research/ConceptualCaptions/leaderboard?active_tab=leaderboard - **Point of Contact:** [Conceptual Captions e-mail](mailto:conceptual-captions@google.com) ### Dataset Summary Conceptual Captions is a dataset consisting of ~3.3M images annotated with captions. In contrast with the curated style of other image caption annotations, Conceptual Caption images and their raw descriptions are harvested from the web, and therefore represent a wider variety of styles. More precisely, the raw descriptions are harvested from the Alt-text HTML attribute associated with web images. To arrive at the current version of the captions, we have developed an automatic pipeline that extracts, filters, and transforms candidate image/caption pairs, with the goal of achieving a balance of cleanliness, informativeness, fluency, and learnability of the resulting captions. ### Dataset Preprocessing This dataset doesn't download the images locally by default. Instead, it exposes URLs to the images. To fetch the images, use the following code: ```python from concurrent.futures import ThreadPoolExecutor from functools import partial import io import urllib import PIL.Image from datasets import load_dataset from datasets.utils.file_utils import get_datasets_user_agent USER_AGENT = get_datasets_user_agent() def fetch_single_image(image_url, timeout=None, retries=0): for _ in range(retries + 1): try: request = urllib.request.Request( image_url, data=None, headers={"user-agent": USER_AGENT}, ) with urllib.request.urlopen(request, timeout=timeout) as req: image = PIL.Image.open(io.BytesIO(req.read())) break except Exception: image = None return image def fetch_images(batch, num_threads, timeout=None, retries=0): fetch_single_image_with_args = partial(fetch_single_image, timeout=timeout, retries=retries) with ThreadPoolExecutor(max_workers=num_threads) as executor: batch["image"] = list(executor.map(fetch_single_image_with_args, batch["image_url"])) return batch num_threads = 20 dset = load_dataset("google-research-datasets/conceptual_captions") dset = dset.map(fetch_images, batched=True, batch_size=100, fn_kwargs={"num_threads": num_threads}) ``` ### Supported Tasks and Leaderboards - `image-captioning`: This dataset can be used to train model for the Image Captioning task. The leaderboard for this task is available [here](https://ai.google.com/research/ConceptualCaptions/competition?active_tab=leaderboard). Official submission output captions are scored against the reference captions from the hidden test set using [this](https://github.com/tylin/coco-caption) implementation of the CIDEr (primary), ROUGE-L and SPICE metrics. ### Languages All captions are in English. ## Dataset Structure ### Data Instances #### `unlabeled` Each instance in this configuration represents a single image with a caption: ``` { 'image_url': 'http://lh6.ggpht.com/-IvRtNLNcG8o/TpFyrudaT6I/AAAAAAAAM6o/_11MuAAKalQ/IMG_3422.JPG?imgmax=800', 'caption': 'a very typical bus station' } ``` #### `labeled` Each instance in this configuration represents a single image with a caption with addtional machine-generated image labels and confidence scores: ``` { 'image_url': 'https://thumb1.shutterstock.com/display_pic_with_logo/261388/223876810/stock-vector-christmas-tree-on-a-black-background-vector-223876810.jpg', 'caption': 'christmas tree on a black background .', 'labels': ['christmas tree', 'christmas decoration', 'font', 'text', 'graphic design', 'illustration','interior design', 'tree', 'christmas eve', 'ornament', 'fir', 'plant', 'pine', 'pine family', 'graphics'], 'MIDs': ['/m/025nd', '/m/05fc9mj', '/m/03gq5hm', '/m/07s6nbt', '/m/03c31', '/m/01kr8f', '/m/0h8nzzj', '/m/07j7r', '/m/014r1s', '/m/05ykl4', '/m/016x4z', '/m/05s2s', '/m/09t57', '/m/01tfm0', '/m/021sdg'], 'confidence_scores': [0.9818305373191833, 0.952756941318512, 0.9227379560470581, 0.8524878621101379, 0.7597672343254089, 0.7493422031402588, 0.7332468628883362, 0.6869218349456787, 0.6552258133888245, 0.6357356309890747, 0.5992692708969116, 0.585474967956543, 0.5222904086112976, 0.5113164782524109, 0.5036579966545105] } ``` ### Data Fields #### `unlabeled` - `image_url`: Static URL for downloading the image associated with the post. - `caption`: Textual description of the image. #### `labeled` - `image_url`: Static URL for downloading the image associated with the post. - `caption`: Textual description of the image. - `labels`: A sequence of machine-generated labels obtained using the [Google Cloud Vision API](https://cloud.google.com/vision). - `MIDs`: A sequence of machine-generated identifiers (MID) corresponding to the label's Google Knowledge Graph entry. - `confidence_scores`: A sequence of confidence scores denoting how likely the corresponing labels are present on the image. ### Data Splits #### `unlabeled` The basic version of the dataset split into Training and Validation splits. The Training split consists of 3,318,333 image-URL/caption pairs and the Validation split consists of 15,840 image-URL/caption pairs. #### `labeled` The labeled version of the dataset with a single. The entire data is contained in Training split, which is a subset of 2,007,090 image-URL/caption pairs from the Training set of the `unlabeled` config. ## Dataset Creation ### Curation Rationale From the paper: > In this paper, we make contributions to both the data and modeling categories. First, we present a new dataset of caption annotations Conceptual Captions (Fig. 1), which has an order of magnitude more images than the COCO dataset. Conceptual Captions consists of about 3.3M himage, descriptioni pairs. In contrast with the curated style of the COCO images, Conceptual Captions images and their raw descriptions are harvested from the web, and therefore represent a wider variety of styles. ### Source Data #### Initial Data Collection and Normalization From the homepage: >For Conceptual Captions, we developed a fully automatic pipeline that extracts, filters, and transforms candidate image/caption pairs, with the goal of achieving a balance of cleanliness, informativeness, fluency, and learnability of the resulting captions. Because no human annotators are involved, the Conceptual Captions dataset generation process is highly scalable. > >To generate this dataset, we started with a Flume pipeline that processes billions of Internet webpages, extracting, filtering, and processing candidate image and caption pairs, and keeping those that pass through several filters. > >We first screen for certain properties like size, aspect ratio, adult content scores. These filters discard more than 65% of the candidates. Next, we use Alt-Texts for text-based filtering, removing captions with non-descriptive text (such as SEO tags or hashtags); we also discard texts with high sentiment polarity or adult content scores, resulting in just 3% of the incoming candidates passing through. > >In the next step, we filter out candidates for which none of the text tokens can be mapped to the visual content of the image. We use image classifiers (e.g., Google Cloud Vision APIs) to assign class labels to images and match these labels against the candidate text (allowing morphological transformations), discarding >around 60% of the candidates that reach this stage. > >The candidates passing the above filters tend to be good Alt-text image descriptions. However, a large majority of these use proper names (for people, venues, locations, etc.), brands, dates, quotes, etc. This creates two distinct problems. First, some of these cannot be inferred based on the image pixels alone. This is problematic because unless the image has the necessary visual information it is not useful for training. Second, even if the proper names could be inferred from the image it is extremely difficult for a model to learn to perform both fine-grained classification and natural-language descriptions simultaneously. We posit that if automatic determination of names, locations, brands, etc. is needed, it should be done as a separate task that may leverage image meta-information (e.g. GPS info), or complementary techniques such as OCR. > >We address the above problems with the insight that proper names should be replaced by words that represent the same general notion, i.e., by their concept. For example, we remove locations (“Crowd at a concert in Los Angeles“ becomes “Crowd at a concert”), names (e.g., “Former Miss World Priyanka Chopra on the red carpet” becomes “actor on the red carpet”), proper noun modifiers (e.g., “Italian cuisine” becomes just “cuisine”) and noun phrases (e.g., “actor and actor” becomes “actors”). Around 20% of the samples are discarded during this transformation because it can leave sentences too short, or otherwise inconsistent. > >Finally, we perform another round of filtering to identify concepts with low-count. We cluster all resolved entities (e.g., “actor”, “dog”, “neighborhood”, etc.) and keep only the candidate types which have a count of over 100 mentions. This retains around 16K entity concepts such as: “person”, “actor”, “artist”, “player” and “illustration”. The less frequent ones that we dropped include “baguette”, “bridle”, “deadline”, “ministry” and “funnel”. #### Who are the source language producers? Not specified. ### Annotations #### Annotation process Annotations are extracted jointly with the images using the automatic pipeline. #### Who are the annotators? Not specified. ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators Piyush Sharma, Nan Ding, Sebastian Goodman and Radu Soricut. ### Licensing Information The dataset may be freely used for any purpose, although acknowledgement of Google LLC ("Google") as the data source would be appreciated. The dataset is provided "AS IS" without any warranty, express or implied. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset. ### Citation Information ```bibtex @inproceedings{sharma2018conceptual, title = {Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning}, author = {Sharma, Piyush and Ding, Nan and Goodman, Sebastian and Soricut, Radu}, booktitle = {Proceedings of ACL}, year = {2018}, } ``` ### Contributions Thanks to [@abhishekkrthakur](https://github.com/abhishekkrthakur) and [@mariosasko](https://github.com/mariosasko) for adding this dataset.

# 概念性标题（Conceptual Captions）数据集卡片 ## 目录 - [数据集概述](#dataset-description) - [数据集摘要](#dataset-summary) - [数据集预处理](#dataset-preprocessing) - [支持任务与排行榜](#supported-tasks-and-leaderboards) - [语言](#languages) - [数据集结构](#dataset-structure) - [数据实例](#data-instances) - [数据字段](#data-fields) - [数据划分](#data-splits) - [数据集构建](#dataset-creation) - [构建初衷](#curation-rationale) - [源数据](#source-data) - [标注信息](#annotations) - [个人与敏感信息](#personal-and-sensitive-information) - [数据使用注意事项](#considerations-for-using-the-data) - [数据集的社会影响](#social-impact-of-dataset) - [偏差讨论](#discussion-of-biases) - [其他已知局限](#other-known-limitations) - [附加信息](#additional-information) - [数据集维护者](#dataset-curators) - [许可信息](#licensing-information) - [引用信息](#citation-information) - [贡献](#contributions) ## 数据集概述 - **主页：** [概念性标题数据集主页](https://ai.google.com/research/ConceptualCaptions/) - **代码仓库：** [概念性标题数据集代码库](https://github.com/google-research-datasets/conceptual-captions) - **相关论文：** [概念性标题数据集：用于自动图像字幕的经过清洁、上位词替换的图像Alt文本数据集](https://www.aclweb.org/anthology/P18-1238/) - **排行榜：** [概念性标题数据集排行榜](https://ai.google.com/research/ConceptualCaptions/competition?active_tab=leaderboard)https://ai.google.com/research/ConceptualCaptions/leaderboard?active_tab=leaderboard - **联系人：** [概念性标题数据集邮箱](mailto:conceptual-captions@google.com) ### 数据集摘要概念性标题数据集（Conceptual Captions）是一个包含约330万张配有标题注释的图像的数据集。与其他经过精选的图像标题标注数据集不同，本数据集的图像及其原始描述均从网络抓取，因此涵盖了更广泛的风格类型。更具体地说，原始描述来源于与网络图像关联的HTML Alt文本属性。为得到当前版本的标题，我们开发了一套自动流水线，用于提取、过滤和转换候选图像-标题对，目标是使最终生成的标题在清洁性、信息性、流畅性与可学习性之间达到平衡。 ### 数据集预处理本数据集默认不会在本地下载图像，仅提供图像的URL。若需获取图像，请使用以下代码： python from concurrent.futures import ThreadPoolExecutor from functools import partial import io import urllib import PIL.Image from datasets import load_dataset from datasets.utils.file_utils import get_datasets_user_agent USER_AGENT = get_datasets_user_agent() def fetch_single_image(image_url, timeout=None, retries=0): for _ in range(retries + 1): try: request = urllib.request.Request( image_url, data=None, headers={"user-agent": USER_AGENT}, ) with urllib.request.urlopen(request, timeout=timeout) as req: image = PIL.Image.open(io.BytesIO(req.read())) break except Exception: image = None return image def fetch_images(batch, num_threads, timeout=None, retries=0): fetch_single_image_with_args = partial(fetch_single_image, timeout=timeout, retries=retries) with ThreadPoolExecutor(max_workers=num_threads) as executor: batch["image"] = list(executor.map(fetch_single_image_with_args, batch["image_url"])) return batch num_threads = 20 dset = load_dataset("google-research-datasets/conceptual_captions") dset = dset.map(fetch_images, batched=True, batch_size=100, fn_kwargs={"num_threads": num_threads}) ### 支持任务与排行榜 - `图像字幕（image-captioning）`：本数据集可用于训练图像字幕任务的模型。该任务的排行榜可参见[此处](https://ai.google.com/research/ConceptualCaptions/competition?active_tab=leaderboard)。官方提交的输出标题将通过[this](https://github.com/tylin/coco-caption)实现的CIDEr（主要指标）、ROUGE-L与SPICE指标，与隐藏测试集的参考标题进行比对打分。 ### 语言所有标题均为英语。 ## 数据集结构 ### 数据实例 #### 未标注版（unlabeled）该配置下的每个实例代表一张配有标题的单张图像： { 'image_url': 'http://lh6.ggpht.com/-IvRtNLNcG8o/TpFyrudaT6I/AAAAAAAAM6o/_11MuAAKalQ/IMG_3422.JPG?imgmax=800', 'caption': 'a very typical bus station' } #### 标注版（labeled）该配置下的每个实例代表一张配有标题的单张图像，额外包含机器生成的图像标签与置信度分数： { 'image_url': 'https://thumb1.shutterstock.com/display_pic_with_logo/261388/223876810/stock-vector-christmas-tree-on-a-black-background-vector-223876810.jpg', 'caption': 'christmas tree on a black background .', 'labels': ['christmas tree', 'christmas decoration', 'font', 'text', 'graphic design', 'illustration','interior design', 'tree', 'christmas eve', 'ornament', 'fir', 'plant', 'pine', 'pine family', 'graphics'], 'MIDs': ['/m/025nd', '/m/05fc9mj', '/m/03gq5hm', '/m/07s6nbt', '/m/03c31', '/m/01kr8f', '/m/0h8nzzj', '/m/07j7r', '/m/014r1s', '/m/05ykl4', '/m/016x4z', '/m/05s2s', '/m/09t57', '/m/01tfm0', '/m/021sdg'], 'confidence_scores': [0.9818305373191833, 0.952756941318512, 0.9227379560470581, 0.8524878621101379, 0.7597672343254089, 0.7493422031402588, 0.7332468628883362, 0.6869218349456787, 0.6552258133888245, 0.6357356309890747, 0.5992692708969116, 0.585474967956543, 0.5222904086112976, 0.5113164782524109, 0.5036579966545105] } ### 数据字段 #### 未标注版（unlabeled） - `image_url`：用于下载关联图像的静态URL。 - `caption`：图像的文本描述。 #### 标注版（labeled） - `image_url`：用于下载关联图像的静态URL。 - `caption`：图像的文本描述。 - `labels`：通过[Google Cloud Vision API（谷歌云视觉API）]生成的机器标签序列。 - `MIDs`：与标签对应的谷歌知识图谱条目相关的机器生成标识符序列。 - `confidence_scores`：表示对应标签出现在图像中的可能性的置信度分数序列。 ### 数据划分 #### 未标注版（unlabeled）本数据集的基础版本分为训练集与验证集。训练集包含3,318,333对图像URL-标题，验证集包含15,840对图像URL-标题。 #### 标注版（labeled）本数据集的标注版本仅包含训练集子集，该子集从`unlabeled`配置的训练集中选取，共2,007,090对图像URL-标题。 ## 数据集构建 ### 构建初衷摘自论文： > 在本研究中，我们在数据与建模两个维度均做出了贡献。首先，我们提出了一个新的标题标注数据集——概念性标题数据集（Conceptual Captions），其图像数量是COCO数据集的一个数量级以上。概念性标题数据集包含约330万张图像-描述对。与COCO图像经过精选的风格不同，本数据集的图像及其原始描述均从网络抓取，因此涵盖了更广泛的风格类型。 ### 源数据 #### 初始数据收集与归一化摘自数据集主页： > 针对概念性标题数据集，我们开发了一套全自动流水线，用于提取、过滤和转换候选图像-标题对，目标是使最终生成的标题在清洁性、信息性、流畅性与可学习性之间达到平衡。由于无需人工标注人员参与，本数据集的生成过程具有高度可扩展性。 > > 为生成本数据集，我们首先使用Flume流水线处理数十亿个互联网网页，提取、过滤和处理候选图像-标题对，并保留通过多轮过滤的样本。 > > 我们首先筛选部分属性，如图像尺寸、宽高比、成人内容评分。这些过滤步骤会剔除超过65%的候选样本。接下来，我们使用Alt文本进行基于文本的过滤，移除描述性较差的标题（如SEO标签或话题标签）；我们同时会剔除情感极性过高或成人内容评分超标的文本，最终仅保留约3%的传入候选样本。 > > 在下一阶段，我们会过滤掉那些文本标记无法与图像视觉内容匹配的候选样本。我们使用图像分类器（如Google Cloud Vision API（谷歌云视觉API））为图像分配类别标签，并将这些标签与候选文本进行匹配（允许形态转换），剔除约60%进入此阶段的候选样本。 > > 通过上述过滤的候选样本通常是优质的Alt文本图像描述。然而，其中绝大多数包含专有名词（如人物、场所、地点等）、品牌、日期、引述等。这会带来两个明显问题：其一，部分专有名词无法仅通过图像像素推断得出，这会带来问题，因为除非图像包含必要的视觉信息，否则该样本对训练并无用处。其二，即使专有名词可以从图像中推断，模型也极难同时学习到细粒度分类与自然语言描述两项任务。我们认为，如果需要自动识别名称、地点、品牌等信息，应当将其作为单独的任务，可利用图像元信息（如GPS信息）或光学字符识别（OCR，Optical Character Recognition）等互补技术实现。 > > 我们通过以下思路解决上述问题：将专有名词替换为代表同一通用概念的词汇，即替换为其上位词。例如，我们移除地点信息（如“洛杉矶一场音乐会的人群”变为“一场音乐会的人群”）、人名（如“红毯上的前世界小姐朴雅卡·乔普拉”变为“红毯上的演员”）、专有名词修饰语（如“意大利料理”变为“料理”）以及名词短语（如“演员与演员”变为“演员们”）。在此转换过程中，约20%的样本会被剔除，因为这可能会使句子过短或语义不一致。 > > 最后，我们执行另一轮过滤，以识别低频次概念。我们对所有已解析的实体（如“演员”“狗”“社区”等）进行聚类，仅保留提及次数超过100次的候选类型。这会保留约16K个实体概念，如：“人物”“演员”“艺术家”“玩家”与“插画”。我们剔除的低频概念包括“法棍面包”“笼头”“截止日期”“部委”与“漏斗”。 #### 源语言生产者是谁？未指定。 ### 标注信息 #### 标注流程标注信息与图像通过自动流水线共同提取。 #### 标注人员未指定。 ### 个人与敏感信息 [更多信息需补充] ## 数据使用注意事项 ### 数据集的社会影响 [更多信息需补充] ### 偏差讨论 [更多信息需补充] ### 其他已知局限 [更多信息需补充] ## 附加信息 ### 数据集维护者皮尤什·夏尔马（Piyush Sharma）、南丁（Nan Ding）、塞巴斯蒂安·古德曼（Sebastian Goodman）与拉杜·索里丘特（Radu Soricut）。 ### 许可信息本数据集可免费用于任何用途，若能注明谷歌有限责任公司（"谷歌"）为数据源将不胜感激。本数据集按“现状”提供，不提供任何明示或暗示的担保。谷歌对因使用本数据集产生的任何直接或间接损害不承担任何责任。 ### 引用信息 bibtex @inproceedings{sharma2018conceptual, title = {Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning}, author = {Sharma, Piyush and Ding, Nan and Goodman, Sebastian and Soricut, Radu}, booktitle = {Proceedings of ACL}, year = {2018}, } ### 贡献致谢：感谢[@abhishekkrthakur](https://github.com/abhishekkrthakur)与[@mariosasko](https://github.com/mariosasko)贡献本数据集条目。

提供机构：

maas

创建时间：

2025-07-07

5,000+

优质数据集

54 个

任务类型

进入经典数据集