five

hammh0a/SynthCLIP

收藏
Hugging Face2024-02-06 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/hammh0a/SynthCLIP
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-nc-4.0 --- <p style="text-align:center; font-size:2em; font-weight:bold;">SynthCI-30M</p> <div style="display: flex; justify-content: center; align-items: center; height: 100%;"> <img src="https://i.ibb.co/kDv612p/ef8b63cb-ce63-4246-8aab-6535711f61f5.webp" alt="Alt text" style="max-width:70%; height:auto;"> </div> This repo contains SynthCI-30M which is the dataset proposed in "SynthCLIP: Are We Ready For a Fully Synthetic CLIP Training?". The dataset contains 30M synthetic text-image pairs covering a wide range of concepts. <div style="text-align:center;"> <p><em>"We will reach a time where machines will create machines."</em></p> </div> ## Abstract We present SynthCLIP, a novel framework for training CLIP models with entirely synthetic text-image pairs, significantly departing from previous methods relying on real data. Leveraging recent text-to-image (TTI) generative networks and large language models (LLM), we are able to generate synthetic datasets of images and corresponding captions at any scale, with no human intervention. With training at scale, SynthCLIP achieves performance comparable to CLIP models trained on real datasets. We also introduce SynthCI-30M, a purely synthetic dataset comprising 30 million captioned images. ## Structure * `SynthCI-30/combined_images_and_captions.csv` contains the image paths with corresponding captions * `SynthCI-30/data` contains 3039 zip files each containing 10K images. ## Citation ``` @misc{hammoud2024synthclip, title={SynthCLIP: Are We Ready for a Fully Synthetic CLIP Training?}, author={Hasan Abed Al Kader Hammoud and Hani Itani and Fabio Pizzati and Philip Torr and Adel Bibi and Bernard Ghanem}, year={2024}, eprint={2402.01832}, archivePrefix={arXiv}, primaryClass={cs.CV} } ```
提供机构:
hammh0a
原始信息汇总

SynthCI-30M 数据集概述

数据集名称

  • SynthCI-30M

数据集大小

  • 30M(30000000个数据点)
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作