five

conceptual_12m

收藏
魔搭社区2025-07-11 更新2025-07-12 收录
下载链接:
https://modelscope.cn/datasets/google-research-datasets/conceptual_12m
下载链接
链接失效反馈
官方服务:
资源简介:
# Dataset Card for Conceptual 12M ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Dataset Preprocessing](#dataset-preprocessing) - [Supported Tasks](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-instances) - [Data Splits](#data-instances) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) ## Dataset Description - **Repository:** [Conceptual 12M repository](https://github.com/google-research-datasets/conceptual-12m) - **Paper:** [Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts](https://arxiv.org/abs/2102.08981) - **Point of Contact:** [Conceptual Captions e-mail](mailto:conceptual-captions@google.com) ### Dataset Summary Conceptual 12M (CC12M) is a dataset with 12 million image-text pairs specifically meant to be used for visionand-language pre-training. Its data collection pipeline is a relaxed version of the one used in Conceptual Captions 3M (CC3M). ### Dataset Preprocessing This dataset doesn't download the images locally by default. Instead, it exposes URLs to the images. To fetch the images, use the following code: ```python from concurrent.futures import ThreadPoolExecutor from functools import partial import io import urllib import PIL.Image from datasets import load_dataset from datasets.utils.file_utils import get_datasets_user_agent USER_AGENT = get_datasets_user_agent() def fetch_single_image(image_url, timeout=None, retries=0): for _ in range(retries + 1): try: request = urllib.request.Request( image_url, data=None, headers={"user-agent": USER_AGENT}, ) with urllib.request.urlopen(request, timeout=timeout) as req: image = PIL.Image.open(io.BytesIO(req.read())) break except Exception: image = None return image def fetch_images(batch, num_threads, timeout=None, retries=0): fetch_single_image_with_args = partial(fetch_single_image, timeout=timeout, retries=retries) with ThreadPoolExecutor(max_workers=num_threads) as executor: batch["image"] = list(executor.map(fetch_single_image_with_args, batch["image_url"])) return batch num_threads = 20 dset = load_dataset("conceptual_12m") dset = dset.map(fetch_images, batched=True, batch_size=100, fn_kwargs={"num_threads": num_threads}) ``` ### Supported Tasks and Leaderboards - `image-captioning`: This dataset can be used to train model for the Image Captioning task. ### Languages All captions are in English. ## Dataset Structure ### Data Instances Each instance represents a single image with a caption: ``` { 'image_url': 'http://lh6.ggpht.com/-IvRtNLNcG8o/TpFyrudaT6I/AAAAAAAAM6o/_11MuAAKalQ/IMG_3422.JPG?imgmax=800', 'caption': 'a very typical bus station' } ``` ### Data Fields - `image_url`: Static URL for downloading the image associated with the post. - `caption`: Textual description of the image. ### Data Splits There is only training data, with a total of 12423374 rows ## Dataset Creation ### Curation Rationale Conceptual 12M shares the same pipeline with Conceptual Captions (CC3M), but relaxes some processing steps. ### Source Data #### Initial Data Collection and Normalization From the paper: > To arrive at CC12M, we keep the image-text filtering intact, and relax the unimodal filters only. First, for image-based filtering, we set the maximum ratio of larger to smaller dimension to 2.5 instead of 2. We still keep only JPEG images with size greater than 400 pixels, and still exclude images that trigger pornography detectors. Second, in text-based filtering, we allow text between 3 and 256 words in the alt-text. We still discard candidates with no noun or no determiner, but permit ones without prepositions. We discard the heuristics regarding high unique-word ratio covering various POS tags and word capitalization. We set the maximum fraction of word repetition allowed to 0.2. Given a larger pool of text due to the above relaxations, the threshold for counting a word type as rare is increased from 5 to 20 > The main motivation for CC3M to perform text transformation is that a majority of candidate captions contain ultrafine-grained entities such as proper names (people, venues, locations, etc.), making it extremely difficult to learn as part of the image captioning task. In contrast, we are not restricted by the end task of image caption generation. Our intuition is that relatively more difficult pre-training data would lead to better transferability. We thus do not perform hypernimization or digit substitution. [...] The only exception to the “keep alt-texts as raw as possible” rule is performing person-name substitutions, which we identify as necessary to protect the privacy of the individuals in these images. For this step, we use the Google Cloud Natural Language APIs to detect all named entities of type Person, and substitute them by a special token . Around 25% of all the alt-texts in CC12M are transformed in this fashion. #### Who are the source language producers? Not specified. ### Annotations #### Annotation process Annotations are extracted jointly with the images using the automatic pipeline. #### Who are the annotators? Not specified. ### Personal and Sensitive Information From the paper: > The only exception to the “keep alt-texts as raw as possible” rule is performing person-name substitutions, which we identify as necessary to protect the privacy of the individuals in these images. For this step, we use the Google Cloud Natural Language APIs to detect all named entities of type Person, and substitute them by a special token . Around 25% of all the alt-texts in CC12M are transformed in this fashion. ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators Soravit Changpinyo, Piyush Sharma, Nan Ding and Radu Soricut. ### Licensing Information The dataset may be freely used for any purpose, although acknowledgement of Google LLC ("Google") as the data source would be appreciated. The dataset is provided "AS IS" without any warranty, express or implied. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset. ### Citation Information ```bibtex @inproceedings{changpinyo2021cc12m, title = {{Conceptual 12M}: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts}, author = {Changpinyo, Soravit and Sharma, Piyush and Ding, Nan and Soricut, Radu}, booktitle = {CVPR}, year = {2021}, } ``` ### Contributions Thanks to [@thomasw21](https://github.com/thomasw21) for adding this dataset.

# 数据集卡片:Conceptual 12M ## 目录 - [数据集描述](#dataset-description) - [数据集概述](#dataset-summary) - [数据集预处理](#dataset-preprocessing) - [支持任务与评测榜单](#supported-tasks-and-leaderboards) - [语言类型](#languages) - [数据集结构](#dataset-structure) - [数据实例](#data-instances) - [数据字段](#data-instances) - [数据划分](#data-instances) - [数据集构建](#dataset-creation) - [构建依据](#curation-rationale) - [源数据](#source-data) - [标注信息](#annotations) - [个人与敏感信息](#personal-and-sensitive-information) - [数据集使用注意事项](#considerations-for-using-the-data) - [数据集的社会影响](#social-impact-of-dataset) - [偏差讨论](#discussion-of-biases) - [其他已知局限性](#other-known-limitations) - [附加信息](#additional-information) - [数据集维护者](#dataset-curators) - [许可信息](#licensing-information) - [引用信息](#citation-information) ## 数据集描述 - **代码仓库:** [Conceptual 12M 代码仓库](https://github.com/google-research-datasets/conceptual-12m) - **相关论文:** [Conceptual 12M:拓展网络规模图像-文本预训练以识别长尾视觉概念](https://arxiv.org/abs/2102.08981) - **联系方式:** [Conceptual Captions 官方邮箱](mailto:conceptual-captions@google.com) ### 数据集概述 Conceptual 12M (CC12M) 是一个包含1200万条图像-文本对的数据集,专门用于视觉语言预训练(vision-and-language pre-training)。其数据收集流程是Conceptual Captions 3M (CC3M) 所使用流程的简化版本。 ### 数据集预处理 本数据集默认不会在本地存储图像,仅提供图像的静态下载链接。若需获取图像,请使用下述代码: python from concurrent.futures import ThreadPoolExecutor from functools import partial import io import urllib import PIL.Image from datasets import load_dataset from datasets.utils.file_utils import get_datasets_user_agent USER_AGENT = get_datasets_user_agent() def fetch_single_image(image_url, timeout=None, retries=0): for _ in range(retries + 1): try: request = urllib.request.Request( image_url, data=None, headers={"user-agent": USER_AGENT}, ) with urllib.request.urlopen(request, timeout=timeout) as req: image = PIL.Image.open(io.BytesIO(req.read())) break except Exception: image = None return image def fetch_images(batch, num_threads, timeout=None, retries=0): fetch_single_image_with_args = partial(fetch_single_image, timeout=timeout, retries=retries) with ThreadPoolExecutor(max_workers=num_threads) as executor: batch["image"] = list(executor.map(fetch_single_image_with_args, batch["image_url"])) return batch num_threads = 20 dset = load_dataset("conceptual_12m") dset = dset.map(fetch_images, batched=True, batch_size=100, fn_kwargs={"num_threads": num_threads}) ### 支持任务与评测榜单 - **图像字幕生成(image-captioning)**:本数据集可用于训练图像字幕生成任务的模型。 ### 语言类型 所有字幕均为英文。 ## 数据集结构 ### 数据实例 每个数据实例对应一张附带文本字幕的单幅图像: { 'image_url': 'http://lh6.ggpht.com/-IvRtNLNcG8o/TpFyrudaT6I/AAAAAAAAM6o/_11MuAAKalQ/IMG_3422.JPG?imgmax=800', 'caption': 'a very typical bus station' } ### 数据字段 - `image_url`:用于下载该实例关联图像的静态URL。 - `caption`:图像的文本描述。 ### 数据划分 仅包含训练数据,总共有12423374条记录。 ## 数据集构建 ### 构建依据 Conceptual 12M 与 Conceptual Captions (CC3M) 使用相同的流程,但放宽了部分处理步骤。 ### 源数据 #### 初始数据收集与归一化 摘自论文: > 为构建CC12M,我们保留了原有的图像-文本筛选流程,仅放宽了单模态过滤器的限制。首先,在图像筛选环节,我们将图像长宽比的最大值从2调整为2.5;我们仍仅保留尺寸大于400像素的JPEG图像,同时依旧排除触发色情内容检测的图像。其次,在文本筛选环节,我们允许替代文本(alt-text)的字数范围为3至256词;我们仍会丢弃没有名词或限定词的候选文本,但允许不含介词的文本通过;我们移除了针对覆盖多种词性(Part-of-Speech, POS)标签的高唯一词占比以及单词大小写格式的启发式规则;我们将允许的最大单词重复率设置为0.2。由于上述放宽策略使得文本池规模扩大,我们将稀有词的计数阈值从5提升至20。 > CC3M进行文本转换的主要动机在于,大部分候选字幕包含超细粒度的实体,如专有名词(人物、场所、地点等),这使得图像字幕生成任务的学习难度极大。与之相反,我们并未受到图像字幕生成这一下游任务的限制。我们的直觉是,难度相对更高的预训练数据能够带来更优的迁移性能。因此,我们不会进行连字符规范化与数字替换操作。[...] 本数据集遵循“尽可能保留替代文本原始形态”的原则,唯一的例外是对人名进行替换,我们认为这是保护图像中个体隐私的必要操作。在此步骤中,我们使用谷歌云自然语言API(Google Cloud Natural Language APIs)检测所有类型为“人物”的命名实体,并将其替换为特殊Token(Token)`<PERSON>`。在CC12M的所有替代文本中,约有25%通过这种方式进行了处理。 #### 源语言文本的生成者 未明确说明。 ### 标注信息 #### 标注流程 标注通过自动流程与图像一同提取得到。 #### 标注人员 未明确说明。 ### 个人与敏感信息 摘自论文: > 本数据集遵循“尽可能保留替代文本原始形态”的原则,唯一的例外是对人名进行替换,我们认为这是保护图像中个体隐私的必要操作。在此步骤中,我们使用谷歌云自然语言API(Google Cloud Natural Language APIs)检测所有类型为“人物”的命名实体,并将其替换为特殊Token(Token)`<PERSON>`。在CC12M的所有替代文本中,约有25%通过这种方式进行了处理。 ## 数据集使用注意事项 ### 数据集的社会影响 [需要更多信息] ### 偏差讨论 [需要更多信息] ### 其他已知局限性 [需要更多信息] ## 附加信息 ### 数据集维护者 Soravit Changpinyo、Piyush Sharma、Nan Ding 与 Radu Soricut。 ### 许可信息 本数据集可免费用于任何用途,若能注明谷歌有限责任公司(Google LLC)为数据源将不胜感激。本数据集按“现状”提供,不附带任何明示或暗示的担保。谷歌对因使用本数据集而产生的任何直接或间接损害概不承担责任。 ### 引用信息 bibtex @inproceedings{changpinyo2021cc12m, title = {{Conceptual 12M}: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts}, author = {Changpinyo, Soravit and Sharma, Piyush and Ding, Nan and Soricut, Radu}, booktitle = {CVPR}, year = {2021}, } ### 贡献致谢 感谢[@thomasw21](https://github.com/thomasw21)为本数据集添加支持。
提供机构:
maas
创建时间:
2025-07-07
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作