DataCompDR-12M-bf16
收藏魔搭社区2025-11-27 更新2025-07-05 收录
下载链接:
https://modelscope.cn/datasets/apple/DataCompDR-12M-bf16
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for DataCompDR-12M-BFloat16
<!-- Provide a quick summary of the dataset. -->
This dataset contains synthetic captions, embeddings, and metadata for DataCompDR-12M.
The metadata has been generated using pretrained image-text models on a 12M subset of [DataComp-1B](https://huggingface.co/datasets/mlfoundations/datacomp_1b).
For details on how to use the metadata, please visit our [github repository](https://github.com/apple/ml-mobileclip).
The dataset with the original captions is now available at [mlfoundations/DataComp-12M](https://huggingface.co/datasets/mlfoundations/DataComp-12M).
The UIDs per shards match between [mlfoundations/DataComp-12M](https://huggingface.co/datasets/mlfoundations/DataComp-12M) and [apple/DataCompDR-12M-bf16](https://huggingface.co/datasets/apple/DataCompDR-12M-bf16).
## Dataset Details
### Dataset Description
<!-- Provide a longer summary of what this dataset is. -->
DataCompDR is an image-text dataset and an enhancement to the DataComp dataset.
We reinforce the DataComp dataset using our multi-modal dataset reinforcement strategy.
In particular, we create DataCompDR-1B and DataCompDR-12M by reinforcing the DataComp-1B (BestPool filtering) and a uniform subset of 12.8M samples, DataCompDR-12M.
We have a one-time generation process, the cost of which is amortized over multiple architectures and extensive ablations.
We generate 5 synthetic captions per image using the `coca_ViT-L-14` model in OpenCLIP, and strong random image augmentations (10 for DataCompDR-1B and 30 for DataCompDR-12M).
We compute embeddings of an ensemble of two strong teachers (`ViT-L-14` with pretrained weights `datacomp_xl_s13b_b90k` and openai in OpenCLIP) on augmented images as well as real and synthetic captions.
Embeddings are 1536-D concatenations of 2x768-D vectors.
One seen sample for DataCompDR is a triplet of one randomly augmented image, one ground-truth caption, and one randomly picked synthetic caption.
- **Curated by:** Original data by [DataComp](https://www.datacomp.ai/) and metadata by Apple.
- **License:** We distribute our metadata under our [license](https://github.com/apple/ml-mobileclip/blob/main/LICENSE). The original image url-text samples and metadata were released by [DataComp](https://www.datacomp.ai/) under Creative Common CC-BY-4.0 license. The individual images are under their own copyrights.
- **Repository:** [ml-mobileclip GitHub](https://github.com/apple/ml-mobileclip)
- **Paper:** [MobileCLIP paper](https://arxiv.org/abs/2311.17049)
- **Demo:** Coming Soon
## Uses
<!-- Address questions around how the dataset is intended to be used. -->
Training with DataCompDR shows significant learning efficiency improvement compared to the standard CLIP training.
For example, with a single node of 8×A100 GPUs, we achieve 61.7% zero-shot classification on ImageNet-val in approximately one day when training a ViT-B/16 based CLIP from scratch on DataCompDR-12M.
Training with DataCompDR-1B sets new state-of-the-art performance on several metrics (Fig. 2) while still using a fraction of the training compute budget compared to previous works.
Using DataCompDR, we demonstrate 10x-1000x learning efficiency in comparison to DataComp.
## Dataset Structure
<!-- This section provides a description of the dataset fields, and additional information about the dataset structure such as criteria used to create the splits, relationships between data points, etc. -->
```
- <uid>.url.txt: Image URL (string)
- <uid>.syn.json:
- syn_text: List of synthetic captions (list[string])
- <uid>.paug.json:
- param_aug: List of augmentation parameters (list[list[Union[int,float]]])
- <uid>.pth.gz
- image_emb: List of image embeddings for multiple image augmentations (list[list[Bfloat16]])
- text_emb: List of text embeddings for ground-truth/synthetic captions (list[list[Bfloat16]])
- <uid>.json
- uid: UID of image-text sample in DataComp (string)
- sha256: SHA256 hash of the image (string)
```
## Citation
**[MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training](https://arxiv.org/pdf/2311.17049.pdf). (CVPR 2024)**
*Pavan Kumar Anasosalu Vasu, Hadi Pouransari, Fartash Faghri, Raviteja Vemulapalli, Oncel Tuzel.*
```bibtex
@InProceedings{mobileclip2024,
author = {Pavan Kumar Anasosalu Vasu, Hadi Pouransari, Fartash Faghri, Raviteja Vemulapalli, Oncel Tuzel},
title = {MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2024},
}
```
# DataCompDR-12M-BFloat16 数据集卡片
<!-- 提供该数据集的快速摘要。 -->
本数据集包含针对DataCompDR-12M的合成标题、嵌入向量与元数据。
元数据通过在[DataComp-1B](https://huggingface.co/datasets/mlfoundations/datacomp_1b)的1200万样本子集上运行预训练图像-文本模型生成。
如需了解元数据的使用方法,请访问我们的[GitHub仓库](https://github.com/apple/ml-mobileclip)。
包含原始标题的数据集现已在[mlfoundations/DataComp-12M](https://huggingface.co/datasets/mlfoundations/DataComp-12M)上线。
两个数据集[mlfoundations/DataComp-12M](https://huggingface.co/datasets/mlfoundations/DataComp-12M)与[apple/DataCompDR-12M-bf16](https://huggingface.co/datasets/apple/DataCompDR-12M-bf16)的每个分片的唯一标识符(UID,Unique ID)完全匹配。
## 数据集详情
### 数据集描述
<!-- 提供该数据集的详细摘要。 -->
DataCompDR是一款图像-文本数据集,同时也是对DataComp数据集的增强版本。
我们通过多模态数据集增强策略对DataComp数据集进行了强化。
具体而言,我们分别强化DataComp-1B(采用BestPool过滤)与1280万样本的均匀子集,得到了DataCompDR-1B与DataCompDR-12M。
我们采用一次性生成流程,其计算成本可分摊至多种模型架构与大量消融实验中。
我们使用OpenCLIP中的`coca_ViT-L-14`模型,结合高强度随机图像增强(DataCompDR-1B使用10种增强,DataCompDR-12M使用30种增强)为每张图像生成5条合成标题。
我们基于两个高性能教师模型的集成,对增强后的图像、真实标题与合成标题分别计算嵌入向量:这两个教师模型分别为OpenCLIP中搭载预训练权重`datacomp_xl_s13b_b90k`的`ViT-L-14`,以及OpenAI的`ViT-L-14`。
嵌入向量为2个768维向量的拼接,总维度为1536维。
DataCompDR的一个已见样本由以下三者组成:一张经过随机增强的图像、一条真实标题,以及一条随机选取的合成标题。
- **整理方:** 原始数据由[DataComp](https://www.datacomp.ai/)提供,元数据由苹果公司(Apple)生成。
- **授权协议:** 我们的元数据按照[授权协议](https://github.com/apple/ml-mobileclip/blob/main/LICENSE)进行分发。原始图像URL-文本样本与元数据由[DataComp](https://www.datacomp.ai/)以知识共享CC-BY-4.0协议发布。单张图像的版权归各自所有者所有。
- **代码仓库:** [ml-mobileclip GitHub](https://github.com/apple/ml-mobileclip)
- **学术论文:** [MobileCLIP论文](https://arxiv.org/abs/2311.17049)
- **演示:** 即将上线
## 数据集用途
<!-- 说明该数据集的预期使用场景相关问题。 -->
相较于标准CLIP训练,使用DataCompDR进行训练可显著提升学习效率。
例如,在搭载8×A100 GPU的单节点设备上,我们基于DataCompDR-12M从零开始训练基于ViT-B/16的CLIP模型,仅需约一天即可在ImageNet验证集上达到61.7%的零样本(Zero-shot)分类准确率。
使用DataCompDR-1B进行训练可在多项指标上刷新当前最优性能(见图2),同时相较于此前的相关工作,其训练计算开销仅为其一小部分。
通过DataCompDR,我们证明其学习效率相较于DataComp提升了10倍至1000倍。
## 数据集结构
<!-- 本部分说明数据集的字段信息,以及数据集结构的额外细节,例如划分数据集所用的标准、数据点之间的关系等。 -->
- <uid>.url.txt: 图像URL(字符串类型)
- <uid>.syn.json:
- syn_text: 合成标题列表(字符串列表类型)
- <uid>.paug.json:
- param_aug: 增强参数列表(嵌套列表类型,元素为整数或浮点数)
- <uid>.pth.gz
- image_emb: 多张增强后图像的嵌入向量列表(列表元素为Bfloat16类型的向量列表)
- text_emb: 真实/合成标题的文本嵌入向量列表(列表元素为Bfloat16类型的向量列表)
- <uid>.json
- uid: DataComp中图像-文本样本的唯一标识符(UID,Unique ID,字符串类型)
- sha256: 图像的SHA256哈希值(字符串类型)
## 引用
**[MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training](https://arxiv.org/pdf/2311.17049.pdf). (CVPR 2024)**
*Pavan Kumar Anasosalu Vasu, Hadi Pouransari, Fartash Faghri, Raviteja Vemulapalli, Oncel Tuzel.*
bibtex
@InProceedings{mobileclip2024,
author = {Pavan Kumar Anasosalu Vasu, Hadi Pouransari, Fartash Faghri, Raviteja Vemulapalli, Oncel Tuzel},
title = {MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2024},
}
提供机构:
maas
创建时间:
2025-07-04



