fondant-ai/datacomp-small-clip

Name: fondant-ai/datacomp-small-clip
Creator: fondant-ai
Published: 2024-03-07 08:01:04
License: 暂无描述

Hugging Face2024-03-07 更新2024-06-22 收录

下载链接：

https://hf-mirror.com/datasets/fondant-ai/datacomp-small-clip

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-4.0 configs: - config_name: embeddings data_files: data/*.parquet - config_name: id_mapping data_files: id_mapping/*.parquet task_categories: - image-to-text - image-to-image tags: - images - CLIP - embeddings - FAISS size_categories: - 1M<n<10M --- <a href="https://github.com/ml6team/fondant"> <img src="https://raw.githubusercontent.com/ml6team/fondant/main/docs/art/fondant_banner.svg" width="600px"/> </a> Production-ready data processing made easy and shareable <a href="http://fondant.ai">Explore the Fondant docs »</a> <a href="https://discord.gg/HnTdWhydGp"><img alt="Discord" src="https://dcbadge.vercel.app/api/server/HnTdWhydGp?style=flat-square"></a> # Dataset Card for fondant-ai/datacomp-small-clip  This is a dataset containing image urls and their CLIP embeddings, based on the [datacomp_small](https://huggingface.co/datasets/mlfoundations/datacomp_small) dataset, and processed with [fondant](https://github.com/ml6team/fondant). ## Dataset Details ### Dataset Description  Large (image) datasets are often unwieldy to use due to their sheer size. Assume for instance that we would like to extract all the cat images from such a dataset. We would have to look at every image to classify if it's a cat image or not. And if we want to extract all the dog images next, we again need to look at every image. Instead, we can look at every image once, and calculate a (CLIP) embedding representing its content. Combining these embeddings into an index, we can efficiently search through the dataset with a query, finding specific images, without having to look at each one. ![CLIP index](https://cdn-uploads.huggingface.co/production/uploads/6454cb0e1a543cf97b1b6fd6/Mgl9UAqiwJrV4WDb8Y2-k.png) This is what LAION did for their [LAION-5b dataset](https://laion.ai/blog/laion-5b/), which made it possible to use, like we did in our [ControlNet example](https://github.com/ml6team/fondant-usecase-controlnet). Unfortunately, the LAION-5b dataset and index have been [taken offline](https://laion.ai/notes/laion-maintanence/) (temporarily) and there [aren't any alternatives](https://github.com/rom1504/clip-retrieval/issues/324). This is why we built an index for the Datacomp-12M dataset. While it is a lot smaller than LAION-5b, it should already enable a lot of use cases again, and can hopefully be the start towards building indices for more and larger datasets. - **License:** cc-by-4.0 ### Dataset Sources  - **Original data:** [datacomp_small](https://huggingface.co/datasets/mlfoundations/datacomp_small) - **Repository:** [fondant-clip-index](https://github.com/ml6team/fondant-clip-index) ## Uses  We provide an [example use case](https://github.com/ml6team/fondant-usecase-controlnet) which uses the FAISS index of this dataset to create a dataset of interior design images, used for the fine-tuning of a ControlNet model: ## Dataset Structure  The data repository is structured as follows: - [data/](https://huggingface.co/datasets/fondant-ai/datacomp-small-clip/viewer/embeddings): The dataset containing ids, urls, and CLIP embeddings - [faiss](https://huggingface.co/datasets/fondant-ai/datacomp-small-clip/blob/main/faiss): The faiss index - [id_mapping/](https://huggingface.co/datasets/fondant-ai/datacomp-small-clip/viewer/id_mapping): The mapping of the faiss ids to the original urls ## Dataset Creation We leveraged Fondant to generate the CLIP index and published the pipeline as a [git repository](https://github.com/ml6team/fondant-clip-index). The pipeline consists of 4 steps: - A [`load_from_hf_hub`](https://fondant.ai/en/stable/components/hub/#load_from_hf_hub#description) operation that loads the [datacomp_small](https://huggingface.co/datasets/mlfoundations/datacomp_small) dataset from huggingface into the Fondant workspace and format. - A [`download_images`](https://fondant.ai/en/stable/components/hub/#download_images#description) operation which downloads the actual images from the urls in the dataset. - A [`embed_images`](https://fondant.ai/en/stable/components/hub/#embed_images#description) operation which embeds the downloaded images using a CLIP model. - A [`write_to_file`](https://fondant.ai/en/stable/components/hub/#write_to_file#description) operation which writes the original urls and generated embeddings to the chosen destination. After running the pipeline, we used [`autofaiss`](https://github.com/criteo/autofaiss) to build the CLIP index. ### Execution details ### Download images We downloaded the images with 32 cores in parallel, each opening up to 25 concurrent connections, and achieved a success rate of 72%, resulting in 9.251.172 images. The downloading was executed on a VM on GCP using the Fondant Docker runner. We originally planned to run this on Vertex AI, but moved to a VM when noticing lower network bandwidth on Vertex. The success rate can probably be further improved by setting up a faster DNS resolver. ### Embed images We leveraged the [`laion/CLIP-ViT-B-32-laion2B-s34B-b79K`](https://huggingface.co/laion/CLIP-ViT-B-32-laion2B-s34B-b79K) CLIP model. We chose this model because of a couple of reasons. It is popular, which makes it easy to use with existing embeddings, it is small, which makes it cheap to run, and it is an open model trained on open data. We appreciate any feedback on our choice of model, so we can take this into account if we generate indices for larger datasets in the future. The embedding was executed on 4 T4 GPUs on Google Cloud using our Vertex AI runner, with a batch size of 32. The execution took 8:15 hours. ## Terms and Conditions Under no circumstances can Fondant be held liable by a third party for (i) the accuracy or correctness of the content, (ii) an alleged infringement of intellectual property rights or (iii) any other alleged claim, action, injunction or suit resulting from the publication or use of the dataset. ## Dataset Card Contact - Email: [info@fondant.ai](mailto:info@fondant.ai) - Discord: [https://discord.gg/HnTdWhydGp](https://discord.gg/HnTdWhydGp)

This is a dataset containing image URLs and their corresponding CLIP embeddings, based on the datacomp_small dataset and processed with fondant. The main purpose of the dataset is to efficiently search for specific images by calculating the CLIP embeddings for each image and combining them into an index, without having to look at each image individually. The dataset structure includes data files, FAISS index, and ID mapping. The creation process of the dataset involves loading data from the Hugging Face Hub, downloading images, embedding images using a CLIP model, and finally writing to files. The execution details of the dataset include specific parameters and configurations for image downloading and embedding.

提供机构：

fondant-ai

原始信息汇总

数据集卡片 for fondant-ai/datacomp-small-clip

数据集概述

这是一个包含图像URL及其CLIP嵌入的数据集，基于datacomp_small数据集，并使用fondant进行处理。

数据集详情

数据集描述

大型（图像）数据集由于其庞大的规模而常常难以使用。例如，如果我们想从这个数据集中提取所有的猫图像，我们必须查看每张图像以分类它是否是猫图像。如果我们接下来想提取所有的狗图像，我们又需要查看每张图像。

相反，我们可以查看每张图像一次，并计算一个代表其内容的（CLIP）嵌入。将这些嵌入组合成一个索引，我们可以通过查询高效地搜索整个数据集，找到特定的图像，而无需查看每一张图像。

数据集来源

原始数据： datacomp_small
仓库： fondant-clip-index

使用

我们提供了一个示例用例，该用例使用此数据集的FAISS索引来创建一个室内设计图像数据集，用于ControlNet模型的微调。

数据集结构

数据集仓库结构如下：

data/：包含id、url和CLIP嵌入的数据集
faiss：FAISS索引
id_mapping/：FAISS id到原始url的映射

数据集创建

我们利用Fondant生成CLIP索引，并将管道发布为git仓库。管道包括以下步骤：

从huggingface加载datacomp_small数据集到Fondant工作区和格式中。
从数据集中的url下载实际图像。
使用CLIP模型嵌入下载的图像。
将原始url和生成的嵌入写入所选目的地。

在运行管道后，我们使用autofaiss构建CLIP索引。

执行细节

下载图像

我们使用32个核心并行下载图像，每个核心最多打开25个并发连接，成功率为72%，最终下载了9,251,172张图像。下载在GCP上的VM上使用Fondant Docker运行器执行。

嵌入图像

我们利用laion/CLIP-ViT-B-32-laion2B-s34B-b79K CLIP模型。选择此模型是因为它受欢迎、体积小且是基于开放数据训练的开放模型。嵌入在4个T4 GPU上执行，批量大小为32，耗时8小时15分钟。

条款和条件

在任何情况下，Fondant都不对第三方因（i）内容的准确性或正确性，（ii）涉嫌侵犯知识产权或（iii）任何其他涉嫌的索赔、行动、禁令或诉讼负责。

数据集卡片联系

邮箱：info@fondant.ai
Discord：https://discord.gg/HnTdWhydGp

5,000+

优质数据集

54 个

任务类型

进入经典数据集