five

DataCompDR-12M

收藏
魔搭社区2025-12-25 更新2025-07-05 收录
下载链接:
https://modelscope.cn/datasets/apple/DataCompDR-12M
下载链接
链接失效反馈
官方服务:
资源简介:
# Dataset Card for DataCompDR-12M <!-- Provide a quick summary of the dataset. --> This dataset contains synthetic captions, embeddings, and metadata for DataCompDR-12M. The metadata has been generated using pretrained image-text models on a 12M subset of [DataComp-1B](https://huggingface.co/datasets/mlfoundations/datacomp_1b). For details on how to use the metadata, please visit our [github repository](https://github.com/apple/ml-mobileclip). The dataset with the original captions is now available at [mlfoundations/DataComp-12M](https://huggingface.co/datasets/mlfoundations/DataComp-12M). The UIDs per shards match between [mlfoundations/DataComp-12M](https://huggingface.co/datasets/mlfoundations/DataComp-12M) and [apple/DataCompDR-12M](https://huggingface.co/datasets/apple/DataCompDR-12M). ## Dataset Details ### Dataset Description <!-- Provide a longer summary of what this dataset is. --> DataCompDR is an image-text dataset and an enhancement to the DataComp dataset. We reinforce the DataComp dataset using our multi-modal dataset reinforcement strategy. In particular, we create DataCompDR-1B and DataCompDR-12M by reinforcing the DataComp-1B (BestPool filtering) and a uniform subset of 12.8M samples, DataCompDR-12M. We have a one-time generation process, the cost of which is amortized over multiple architectures and extensive ablations. We generate 5 synthetic captions per image using the `coca_ViT-L-14` model in OpenCLIP, and strong random image augmentations (10 for DataCompDR-1B and 30 for DataCompDR-12M). We compute embeddings of an ensemble of two strong teachers (`ViT-L-14` with pretrained weights `datacomp_xl_s13b_b90k` and openai in OpenCLIP) on augmented images as well as real and synthetic captions. Embeddings are 1536-D concatenations of 2x768-D vectors. One seen sample for DataCompDR is a triplet of one randomly augmented image, one ground-truth caption, and one randomly picked synthetic caption. - **Curated by:** Original data by [DataComp](https://www.datacomp.ai/) and metadata by Apple. - **License:** We distribute our metadata under our [license](https://github.com/apple/ml-mobileclip/blob/main/LICENSE). The original image url-text samples and metadata were released by [DataComp](https://www.datacomp.ai/) under Creative Common CC-BY-4.0 license. The individual images are under their own copyrights. - **Repository:** [ml-mobileclip GitHub](https://github.com/apple/ml-mobileclip) - **Paper:** [MobileCLIP paper](https://arxiv.org/abs/2311.17049) - **Demo:** Coming Soon ## Uses <!-- Address questions around how the dataset is intended to be used. --> Training with DataCompDR shows significant learning efficiency improvement compared to the standard CLIP training. For example, with a single node of 8×A100 GPUs, we achieve 61.7% zero-shot classification on ImageNet-val in approximately one day when training a ViT-B/16 based CLIP from scratch on DataCompDR-12M. Training with DataCompDR-1B sets new state-of-the-art performance on several metrics (Fig. 2) while still using a fraction of the training compute budget compared to previous works. Using DataCompDR, we demonstrate 10x-1000x learning efficiency in comparison to DataComp. ## Dataset Structure <!-- This section provides a description of the dataset fields, and additional information about the dataset structure such as criteria used to create the splits, relationships between data points, etc. --> ``` - <uid>.url.txt: Image URL (string) - <uid>.syn.json: - syn_text: List of synthetic captions (list[string]) - <uid>.paug.json: - param_aug: List of augmentation parameters (list[list[Union[int,float]]]) - <uid>.npz - image_emb: List of image embeddings for multiple image augmentations (list[list[float]]) - text_emb: List of text embeddings for ground-truth/synthetic captions (list[list[float]]) - <uid>.json - uid: UID of image-text sample in DataComp (string) - sha256: SHA256 hash of the image (string) ``` ## Citation **[MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training](https://arxiv.org/pdf/2311.17049.pdf). (CVPR 2024)** *Pavan Kumar Anasosalu Vasu, Hadi Pouransari, Fartash Faghri, Raviteja Vemulapalli, Oncel Tuzel.* ```bibtex @InProceedings{mobileclip2024, author = {Pavan Kumar Anasosalu Vasu, Hadi Pouransari, Fartash Faghri, Raviteja Vemulapalli, Oncel Tuzel}, title = {MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, } ```

# DataCompDR-12M 数据集卡片 <!-- 提供数据集的快速摘要。 --> 本数据集包含DataCompDR-12M的合成字幕、嵌入向量与元数据。 该元数据基于[DataComp-1B](https://huggingface.co/datasets/mlfoundations/datacomp_1b)的1200万子集,通过预训练图像-文本模型生成。 如需了解元数据的使用方法,请访问我们的[GitHub仓库](https://github.com/apple/ml-mobileclip)。 包含原始字幕的数据集现已发布于[mlfoundations/DataComp-12M](https://huggingface.co/datasets/mlfoundations/DataComp-12M)。 各分片的唯一标识符(UID, Unique ID)在[mlfoundations/DataComp-12M](https://huggingface.co/datasets/mlfoundations/DataComp-12M)与[apple/DataCompDR-12M](https://huggingface.co/datasets/apple/DataCompDR-12M)中保持一致。 ## 数据集详情 ### 数据集描述 <!-- 提供该数据集的详细摘要。 --> DataCompDR是一款图像-文本数据集,为DataComp数据集的增强版本。我们通过多模态数据集增强策略对DataComp数据集进行了强化。具体而言,我们分别针对DataComp-1B(采用BestPool过滤策略)与1280万样本的均匀子集构建了DataCompDR-1B与DataCompDR-12M。 本数据集采用一次性生成流程,其成本可分摊至多种架构与大量消融实验中。我们使用OpenCLIP中的`coca_ViT-L-14`模型为每张图像生成5条合成字幕,并搭配强随机图像增强(DataCompDR-1B使用10种增强,DataCompDR-12M使用30种增强)。 我们基于两个顶尖教师模型的集成对增强后的图像、真实字幕与合成字幕计算嵌入向量:分别为搭载预训练权重`datacomp_xl_s13b_b90k`的`ViT-L-14`,以及OpenCLIP中的openai模型。 嵌入向量为1536维的拼接向量,由两个768维向量拼接而成。 DataCompDR的单条可见样本为一条随机增强后的图像、一条真实字幕与一条随机选取的合成字幕组成的三元组。 - **数据整理方:** 原始数据源自[DataComp](https://www.datacomp.ai/),元数据由苹果公司(Apple)生成。 - **授权协议:** 本数据集的元数据遵循我们的[授权协议](https://github.com/apple/ml-mobileclip/blob/main/LICENSE)。原始图像URL-文本样本与元数据由[DataComp](https://www.datacomp.ai/)基于知识共享CC-BY-4.0协议发布。单张图像的版权归各自所有者所有。 - **代码仓库:** [ml-mobileclip GitHub](https://github.com/apple/ml-mobileclip) - **相关论文:** [MobileCLIP 论文](https://arxiv.org/abs/2311.17049) - **演示:** 即将上线 ## 使用场景 <!-- 解答该数据集的预期使用方式相关问题。 --> 相较于标准的对比语言-图像预训练(CLIP)训练,使用DataCompDR进行训练可显著提升学习效率。例如,在单节点8×A100 GPU的配置下,我们基于DataCompDR-12M从零开始训练基于ViT-B/16的CLIP模型,仅需约1天即可在ImageNet验证集上达到61.7%的零样本(Zero-shot)分类准确率。 使用DataCompDR-1B进行训练在多项指标上刷新了当前最优性能(见图2),同时其训练计算预算仅为先前工作的一小部分。通过DataCompDR,我们证明其学习效率相较于DataComp提升了10倍至1000倍。 ## 数据集结构 <!-- 本节介绍数据集字段,以及数据集拆分标准、数据点间关系等额外结构信息。 --> - <uid>.url.txt:图像URL(字符串类型) - <uid>.syn.json: - syn_text:合成字幕列表(list[string]) - <uid>.paug.json: - param_aug:增强参数列表(list[list[Union[int,float]]]) - <uid>.npz: - image_emb:多张增强图像的图像嵌入向量列表(list[list[float]]) - text_emb:真实/合成字幕的文本嵌入向量列表(list[list[float]]) - <uid>.json: - uid:该图像-文本样本在DataComp中的唯一标识符(字符串类型) - sha256:图像的SHA256哈希值(字符串类型) ## 引用信息 **[MobileCLIP: 通过多模态增强训练实现快速图像-文本模型](https://arxiv.org/pdf/2311.17049.pdf)(CVPR 2024)** *Pavan Kumar Anasosalu Vasu, Hadi Pouransari, Fartash Faghri, Raviteja Vemulapalli, Oncel Tuzel.* bibtex @InProceedings{mobileclip2024, author = {Pavan Kumar Anasosalu Vasu, Hadi Pouransari, Fartash Faghri, Raviteja Vemulapalli, Oncel Tuzel}, title = {MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, }
提供机构:
maas
创建时间:
2025-07-04
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作