TiC-DataComp
收藏魔搭社区2025-11-27 更新2025-07-05 收录
下载链接:
https://modelscope.cn/datasets/apple/TiC-DataComp
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for TiC-DataComp
<!-- Provide a quick summary of the dataset. -->
This dataset containts metadata for TiC-DataComp benchmark for time-continual learning of image-text models.
The dataset containts timestamp information for DataComp-1B in the form of UIDs groupings by year/month sourced from the original CommonCrawl.
We also release UIDs for our TiC-DataCompNet and TiC-DataComp-Retrieval evaluations for continual learning of CLIP models.
For details on how to use the metadata, please visit our [github repository](https://github.com/apple/ml-tic-clip).
## Dataset Details
### Dataset Description
<!-- Provide a longer summary of what this dataset is. -->
Keeping large foundation models up to date on latest data is inherently expensive.
To avoid the prohibitive costs of constantly retraining, it is imperative to continually train these models.
This problem is exacerbated by the lack of any large scale continual learning benchmarks or baselines.
We introduce the first set of web-scale Time-Continual (TiC) benchmarks for training vision-language models:
TiC-DataComp, TiC-YFCC, and TiC-Redcaps. TiC-DataComp, our largest dataset,
contains over 12.7B timestamped image-text pairs spanning 9 years (2014-2022).
We first use our benchmarks to curate various dynamic evaluations to measure temporal robustness of existing models.
We show OpenAI's CLIP (trained on data up to 2020) loses ≈8% zero-shot accuracy on our curated retrieval task from 2021-2022 compared with more recently trained models in OpenCLIP repository.
We then study how to efficiently train models on time-continuous data.
We demonstrate that a simple rehearsal-based approach that continues training from the last checkpoint and replays old data reduces compute by 2.5× when compared to the standard practice of retraining from scratch.
Code is available at [this https URL](https://github.com/apple/ml-tic-clip).
- **Developed by:** Apple
- **License:** See [LICENSE](https://github.com/apple/ml-tic-clip/blob/main/LICENSE)
## Uses
<!-- Address questions around how the dataset is intended to be used. -->
Researchers can use TiC-DataComp dataset to design and evaluate continual learning methods at large-scale for image-text models.
## Dataset Structure
<!-- This section provides a description of the dataset fields, and additional information about the dataset structure such as criteria used to create the splits, relationships between data points, etc. -->
```
- tic-datacomp_training_monthly/<YYYMM>.npy
- List of UIDs for each month.
- tic-datacomp_training_yearly_noeval/<YYY>.npy
- List of UIDs for each year after removing yearly evaluation sets.
- tic-datacomp_retrieval_evals_year2uids: TiC-DataComp-Retrieval evaluation UIDs per year.
- tic-datacompnet_year2uids: TiC-DataCompNet evaluation UIDs per year.
```
## Citation
**[TiC-CLIP: Continual Training of CLIP Models](https://arxiv.org/abs/2310.16226). (ICLR 2024)**
*Garg, S., Farajtabar, M., Pouransari, H., Vemulapalli, R., Mehta, S., Tuzel, O., Shankar, V. and Faghri, F..*
```bibtex
@inproceedings{garg2024tic,
title={TiC-CLIP: Continual Training of CLIP Models},
author={Garg, Saurabh and Farajtabar, Mehrdad and Pouransari, Hadi and Vemulapalli, Raviteja and Mehta, Sachin and Tuzel, Oncel and Shankar, Vaishaal and Faghri, Fartash},
booktitle={The Twelfth International Conference on Learning Representations (ICLR)},
year={2024},
url={https://openreview.net/forum?id=TLADT8Wrhn}
}
# TiC-DataComp 数据集卡片
<!-- 提供数据集的简要概述。 -->
本数据集包含用于图像-文本模型时序持续学习的TiC-DataComp基准测试的元数据。该数据集以唯一标识符(UID)分组的形式提供了DataComp-1B的时间戳信息,分组依据为年份/月份,数据源自原始CommonCrawl(通用网络爬虫数据集)。我们还发布了用于CLIP(对比语言-图像预训练模型)模型时序持续学习评估的TiC-DataCompNet与TiC-DataComp-Retrieval对应的UID集合。如需了解如何使用该元数据,请访问我们的[GitHub仓库](https://github.com/apple/ml-tic-clip)。
## 数据集详情
### 数据集概述
<!-- 提供关于数据集的详细说明。 -->
让大型基础模型适配最新数据的成本极高。为避免反复重新训练带来的高昂成本,对这些模型进行持续训练势在必行。当前缺乏大规模的持续学习基准测试与基线方案,这一问题更为突出。我们推出了首个面向视觉语言模型的Web规模时序持续(Time-Continual,简称TiC)基准测试集:TiC-DataComp、TiC-YFCC以及TiC-Redcaps。其中规模最大的TiC-DataComp包含超过127亿个带时间戳的图像-文本样本对,时间跨度达9年(2014年至2022年)。我们首先利用该基准测试集构建了多种动态评估方案,用于衡量现有模型的时序鲁棒性。实验表明,OpenAI的CLIP(训练数据截止至2020年)在我们针对2021-2022年数据构建的检索任务上的零样本(Zero-shot)准确率下降了约8%,相较OpenCLIP仓库中近期训练的模型表现更差。随后我们研究了如何在时序连续数据上高效训练模型,结果证明,一种基于回放的简单方案:从最新检查点继续训练并复用旧数据,相比从零开始重新训练的标准流程,可将计算量降低至原规模的1/2.5。相关代码已发布于[该链接](https://github.com/apple/ml-tic-clip)。
- **开发方:** 苹果公司(Apple)
- **授权协议:** 详见[LICENSE](https://github.com/apple/ml-tic-clip/blob/main/LICENSE)
## 数据集用途
<!-- 解答关于数据集预期使用方式的相关问题。 -->
研究人员可利用TiC-DataComp数据集,设计并评估面向图像-文本模型的大规模持续学习方法。
## 数据集结构
<!-- 本节将介绍数据集的字段信息,以及数据集划分规则、样本间关联关系等额外结构细节。 -->
- tic-datacomp_training_monthly/<YYYMM>.npy
- 对应每个月份的UID列表。
- tic-datacomp_training_yearly_noeval/<YYY>.npy
- 移除年度评估集后,对应每个年份的UID列表。
- tic-datacomp_retrieval_evals_year2uids: 按年份划分的TiC-DataComp-Retrieval评估集UID。
- tic-datacompnet_year2uids: 按年份划分的TiC-DataCompNet评估集UID。
## 引用信息
**《TiC-CLIP:CLIP模型的持续训练》(https://arxiv.org/abs/2310.16226),国际学习表征会议(ICLR 2024)**
*作者:Garg, Saurabh、Farajtabar, Mehrdad、Pouransari, Hadi、Vemulapalli, Raviteja、Mehta, Sachin、Tuzel, Oncel、Shankar, Vaishaal、Faghri, Fartash*
bibtex
@inproceedings{garg2024tic,
title={TiC-CLIP: Continual Training of CLIP Models},
author={Garg, Saurabh and Farajtabar, Mehrdad and Pouransari, Hadi and Vemulapalli, Raviteja and Mehta, Sachin and Tuzel, Oncel and Shankar, Vaishaal and Faghri, Fartash},
booktitle={The Twelfth International Conference on Learning Representations (ICLR)},
year={2024},
url={https://openreview.net/forum?id=TLADT8Wrhn}
}
提供机构:
maas
创建时间:
2025-07-04



