five

CLIP Pre-training Data

收藏
arXiv2025-09-30 收录
下载链接:
https://github.com/facebookresearch/MetaCLIP/tree/main/mode
下载链接
链接失效反馈
官方服务:
资源简介:
该数据集包含了通过网页抓取获得的图像-标题对,用于训练CLIP模型。尽管数据存在一定的噪声,但所提出的方法试图通过聚类来降低噪声。此外,该数据集旨在训练CLIP模型,重点关注通过聚类减少假阴性样本并增强硬阴性样本。该数据集的规模从数亿到数十亿样本不等,任务是对抗性图像-语言预训练。

This dataset consists of image-text pairs collected via web scraping, intended for training the CLIP model. Although the dataset contains a certain amount of noise, the proposed method attempts to reduce such noise through clustering. Furthermore, this dataset is designed for CLIP model training, with a core focus on reducing false negative samples and enhancing hard negative samples via clustering. The scale of this dataset ranges from hundreds of millions to billions of samples, and the corresponding task is adversarial image-language pre-training.
提供机构:
OpenAI
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作