CLIP Pre-training Data

Name: CLIP Pre-training Data
Creator: OpenAI
License: 暂无描述

arXiv2025-09-30 收录

下载链接：

https://github.com/facebookresearch/MetaCLIP/tree/main/mode

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集包含了通过网页抓取获得的图像-标题对，用于训练CLIP模型。尽管数据存在一定的噪声，但所提出的方法试图通过聚类来降低噪声。此外，该数据集旨在训练CLIP模型，重点关注通过聚类减少假阴性样本并增强硬阴性样本。该数据集的规模从数亿到数十亿样本不等，任务是对抗性图像-语言预训练。

This dataset consists of image-text pairs collected via web scraping, intended for training the CLIP model. Although the dataset contains a certain amount of noise, the proposed method attempts to reduce such noise through clustering. Furthermore, this dataset is designed for CLIP model training, with a core focus on reducing false negative samples and enhancing hard negative samples via clustering. The scale of this dataset ranges from hundreds of millions to billions of samples, and the corresponding task is adversarial image-language pre-training.

提供机构：

OpenAI

5,000+

优质数据集

54 个

任务类型

进入经典数据集