LAION-5B：大规模图文数据集

Name: LAION-5B：大规模图文数据集
Creator: 帕依提提
License: 暂无描述

帕依提提2024-03-04 收录

下载链接：

https://www.payititi.com/opendatasets/show-26684.html

下载链接

链接失效反馈

官方服务：

资源简介：

一、LAION-5B概述 LAION-5B由58.5亿个图像文本组合组成，通过CLIP过滤的图像分类模型，其中23亿是图像-英文文本对，22亿是图像，超过100个是非英语文本对，其余10亿对是不限于特定语言的图像和文本对，例如名称。在发布时发表的一份声明中，LAION研究团队表示，虽然在数十亿个图像文本对上训练的大规模图像文本模型显示出高性能，但这种规模的训练数据集通常不可用。在创建图像和文本对时，LAION 会分析在 Internet 上提供数据的Common Crawl文件，选择文本和图像对，并使用 CLIP 创建高度相似的图像和文本对，提取数据。此外，尽可能删除太短的文本、分辨率过高的图像、重复数据、非法内容等，最终保留了由 58.5 亿个图像和文本对组成的样本。 LAION-5B通过CommonCrawl获取文本和图片，OpenAI的CLIP计算后获取图像和文本的相似性，并删除相似度低于设定阈值的图文对（英文阈值0.28，其余阈值0.26），500亿图片保留了不到60亿，最后形成58.5亿个图文对，包括23.2亿的英语，22.6亿的100+语言及12.7亿的未知语言。 LAION-5B的数据规模目前最大，可以对许多未公开的多模态模型进行训练并获得较好效果，并公开了第一个开源的CLIP模型。并且数据多样，包含各种领域图片，对于后续研究提供了更多的方向，比如数据重叠、图片噪声、不适图片筛选、低资源语言、自然语言对于多模态的作用、模型偏差等等。但如果将LAION-5B直接应用于工业，需要注意清洗图片，因为LAION-5B中含水印图片及不适图片，模型会因此产生偏差。二、LAION-5B数据组成： 1、laion2B-en：包含23.2亿条中有英文文本 2、laion2B-multi ：22.6亿包含来自100多种其他语言的文本 3、laion1B-nolang：12.7亿的文本中无法清楚地检测到特定的语言 LAION提供了大规模的图文数据，可以用来做大部分多模态及CV工作，多模态方面包括大规模预训练、图文匹配、图像生成（图像生成、图像修复/编辑等）和文本生成（图像生成文本、VQA等）等下游任务，CV方面包含分类等，LAION也提供了使用数据集训练的模型作为参考。包括但不限于任务：多模态预训练、图文匹配、图文检索。 CLIP模型使用对比学习将图像和文本嵌入到相同空间，标志着图像-文本的多模态的进展，用于图文匹配/检索、zero-shot分类等领域。但CLIP并未公开训练数据，因此LAION分别使用LAION-400M和LAION-2B重新训练了CLIP模型，准确率和OpenAI版本不相上下。 ● 图像生成包括但不限于任务：高分辨率图像生成、图像修复/编辑、文本生成图片、条件图像生成。 LAION提供了子集来过滤不适图片和水印图片，为图像生成进一步提供了条件。目前有不少模型可以基于LAION子集来生成，DALLE这种自回归模型或者GLIDE这种扩散模型，以下给出几个例子： - Stable Diffusion使用LAION-5B的子集，在压缩的空间对图像进行重建，可生成百万像素的高分辨率图片，用于图像修复、图像生成等。 - VQ-Diffusion模型使用矢量量化变异自动编码器，在LAION-400M训练文本生成图像的模型，获得更高的图像质量。 - Imagen[15]在LAION-400M的子集上训练，使用强大的语言模型抽取特征，并指导生成对应文本的高质量图像，击败DALLE-2[20]实现SOTA。 - 也可以挑选其中领域图片进行生成，如人脸生成FARL。 ● 文本生成包括但不限于任务：图像生成文本、VQA、Visual Entailment - BLIP重新在LAION-400M中115M子集上训练，再使用CLIP对候选描述排序，评测后优于其他模型，用于描述生成和图文匹配。 - MAGMA[19]在LAION子集上训练，基于适配器的微调来增强语言模型的生成，为视觉问题生成答案，仅使用simVLM的0.2%的数据量但生成了较好的结果。可以做zero-shot、finetune和训练。通过web搜索子集或官方提供的子集，可以做构建分类识别，水印识别、色情内容识别、面部特征学习等等。也可以通过提供的大规模预训练模型，在下游任务做zero-shot和finetune。图5: 对比了WIT(官方)、在LAION-400M和LAION-2B-en上训练的CLIP模型在下游数据集的zero-shot性能对比，可以看到LAION训练的模型性能优越。 LAION数据丰富，可以筛选需要的数据做其他任务，比如可以在LAION-2B-multi中筛选指定语言数据做低资源语言任务，可以做数据重叠对模型的影响、模型偏见等等。对于有丰富GPU资源的同学，在训练任务时，可以使用全集/子集数据进行大规模训练。对于资源相对有限的同学，无法进行大规模训练，依然可以使用LAION预训练模型进行zero-shot、finetune等研究，也可以将其作为图像资源池自行检索所需图像。可以使用全集/子集来训练，完成多模态、视觉领域相关任务，往往对资源需求较大。 ● 全集为58.5亿图文对，通过CLIP过滤，含有少量噪声和不适数据。 ● 子集参考2.1中提供的多种子集，包括但不限于无不适图片子集、无水印子集、超分辨率子集、美学子集等等。 ● 如果没有合适的子集，也可以通过web检索页面，到合适的数据下载，可以生成图像子集进行训练，也可以选择适合训练的图像分辨率，该方法的好处是可以根据自定义场景选择图片。对于资源有限的工程师，可以选择LAION-5B中所需数据和LAION-5B提供的预训练模型，进行训练。 ● 数据方面可以选取LAION-5B的部分数据进行训练，比如通过web检索界面检索自定义场景图片，或者使用有/无水印图片、高分辨率图片、美学分数较高图片等等，进行小规模训练。 ● 模型方面可以使用LAION提供的预训练模型对下游进行zero-shot、few-shot或finetune。 - zero-shot/few-shot：官方提供了大规模预训练的开源模型，CLIP、BLIP等，效果显著，基于LAION训练的CLIP性能与原模型不相上下。基于LAION-400M训练的CLIP性能可以参考图6。 - finetune：官方提供了微调方式供参考：https://github.com/mlfoundations/wise-ft，也可以采取常规的finetune方式进行训练。图6: CLIP基于LAION-400M对ImageNet、ImageNetV2、Birdsnap、Country211、Flowers102、GTSRB、Standford Cars、UCR101等数据集进行测试，和OpenAI的CLIP性能不相上下。数据来源：https://github.com/mlfoundations/open_clip

1. Overview of LAION-5B LAION-5B consists of 5.85 billion image-text pairs, filtered by a CLIP-based image classification model. Specifically, it includes 2.32 billion image-English text pairs, 2.26 billion image-text pairs with text from over 100 non-English languages, and 1.27 billion pairs with undetectable specific language (e.g., proper nouns). In a statement released alongside the dataset, the LAION research team noted that while large-scale image-text models trained on billions of image-text pairs have demonstrated strong performance, training datasets of this scale are typically not publicly available. When curating image-text pairs, LAION analyzes Common Crawl files hosting data on the Internet, selects candidate image-text pairs, and extracts high-similarity pairs using CLIP. Additionally, overly short text, excessively high-resolution images, duplicate data, and illegal content are removed as much as possible, resulting in the finalized 5.85 billion image-text pairs. LAION acquires text and image data from Common Crawl, computes the similarity between images and text using OpenAI's CLIP, and filters out pairs with similarity below preset thresholds (0.28 for English pairs, 0.26 for non-English pairs). Out of the initial 50 billion images sourced from Common Crawl, fewer than 6 billion were retained, eventually forming the 5.85 billion image-text pairs mentioned above. As the largest publicly available dataset of its scale to date, LAION-5B enables effective training of numerous unpublished multimodal models, and also released the first open-source CLIP model. The dataset covers diverse domains and image types, providing more research directions for subsequent studies, including data overlap, image noise, inappropriate content filtering, low-resource languages, the role of natural language in multimodality, model bias, and more. However, direct industrial application of LAION-5B requires careful image cleaning, as the dataset contains watermarked and inappropriate images, which may introduce bias into trained models. 2. Composition of LAION-5B Data 1. laion2B-en: Contains 2.32 billion pairs with English text 2. laion2B-multi: Contains 2.26 billion pairs with text from over 100 other languages 3. laion1B-nolang: Contains 1.27 billion pairs where no specific language can be clearly detected in the text LAION provides large-scale image-text data suitable for most multimodal and computer vision (CV) tasks. Multimodal tasks include large-scale pre-training, image-text matching, image generation (e.g., image generation, image inpainting/editing), text generation (e.g., text-to-image generation, VQA), and other downstream tasks. CV tasks include classification, etc. LAION also provides models trained on its datasets as references. Common downstream tasks include but are not limited to: multimodal pre-training, image-text matching, image-text retrieval. CLIP models use contrastive learning to embed images and text into the same latent space, marking a major advancement in image-text multimodality, and are used for image-text matching/retrieval, zero-shot classification, and other fields. However, OpenAI did not release its CLIP training data. Therefore, LAION re-trained CLIP models using LAION-400M and LAION-2B respectively, achieving performance comparable to the OpenAI original version. ● Image Generation Common downstream tasks include but are not limited to: high-resolution image generation, image inpainting/editing, text-to-image generation, conditional image generation. LAION provides subsets for filtering inappropriate and watermarked images, further supporting image generation research. Currently, many models have been trained on LAION subsets, including autoregressive models like DALL-E and diffusion models like GLIDE. Here are several examples: - Stable Diffusion uses a subset of LAION-5B to reconstruct images in compressed latent space, generating megapixel-level high-resolution images for tasks such as image inpainting and generation. - The VQ-Diffusion model uses vector quantized variational autoencoders and is trained on LAION-400M for text-to-image generation, achieving higher image quality. - Imagen[15] is trained on a subset of LAION-400M, using a powerful language model to extract features and guide the generation of high-quality images matching the input text, outperforming DALL-E-2[20] to achieve state-of-the-art (SOTA) results. - Targeted image generation for specific domains is also possible, such as face generation via FARL. ● Text Generation Common downstream tasks include but are not limited to: image-to-text generation, VQA, Visual Entailment. - BLIP was re-trained on a 115M subset of LAION-400M, then ranks candidate descriptions using CLIP, outperforming other models in evaluations for caption generation and image-text matching. - MAGMA[19] is trained on a LAION subset, using adapter-based fine-tuning to enhance the language model's generation capabilities, generating answers to visual questions. It achieves good results using only 0.2% of the data volume of simVLM. These datasets support zero-shot learning, fine-tuning, and full training. By using web search subsets or official provided subsets, users can perform tasks such as classification recognition, watermark detection, pornographic content recognition, facial feature learning, etc. Additionally, using the large-scale pre-trained models provided by LAION, users can conduct zero-shot learning and fine-tuning on downstream tasks. Figure 5 compares the zero-shot performance of CLIP models trained on LAION-400M, LAION-2B-en, and the official WIT dataset on downstream datasets, showing that models trained on LAION datasets achieve superior performance. LAION's rich data allows users to select targeted data for other tasks. For example, filtering data of specific languages from LAION-2B-multi for low-resource language tasks, studying the impact of data overlap on models, model bias, and more. For researchers with abundant GPU resources, full or subset data can be used for large-scale training. For those with limited resources, zero-shot learning, fine-tuning, and other research can still be conducted using LAION's pre-trained models, or the dataset can be used as an image resource pool to retrieve required images independently. Training for multimodal and visual domain-related tasks using full or subset data often has high resource requirements. ● Full dataset: 5.85 billion image-text pairs, filtered via CLIP, containing a small amount of noise and inappropriate data. ● Subsets: Refer to multiple subsets provided in Section 2.1, including but not limited to non-inappropriate image subsets, watermark-free subsets, super-resolution subsets, aesthetic subsets, etc. ● If no suitable subset is available, users can retrieve appropriate data via the web retrieval page, generate custom image subsets for training, or select suitable image resolutions. The advantage of this method is that images can be selected according to custom scenarios. For engineers with limited resources, they can select required data from LAION-5B and use the pre-trained models provided by LAION-5B for training. ● Data aspect: Users can select a portion of LAION-5B data for training, such as retrieving custom scene images via the web retrieval interface, or using watermarked/watermark-free images, high-resolution images, high-aesthetic-score images, etc., for small-scale training. ● Model aspect: Users can use the pre-trained models provided by LAION for zero-shot, few-shot, or fine-tuning on downstream tasks. - Zero-shot/few-shot: The official release includes large-scale pre-trained open-source models such as CLIP and BLIP, with significant performance. The CLIP models trained on LAION have performance comparable to the original OpenAI models. The performance of CLIP trained on LAION-400M can be referenced in Figure 6. - Fine-tuning: The official provides a fine-tuning method for reference: https://github.com/mlfoundations/wise-ft, and conventional fine-tuning methods can also be used. Figure 6: CLIP trained on LAION-400M is tested on datasets including ImageNet, ImageNetV2, Birdsnap, Country211, Flowers102, GTSRB, Stanford Cars, UCR101, etc., achieving performance comparable to OpenAI's CLIP. Data source: https://github.com/mlfoundations/open_clip

提供机构：

帕依提提

搜集汇总

数据集介绍

背景与挑战

背景概述

LAION-5B是一个由58.5亿个图像文本对组成的大规模多模态数据集，通过CLIP模型过滤和清洗，包含多种语言和领域。该数据集适用于多模态预训练、图像生成、文本生成和分类任务等，是目前最大的公开图文数据集之一。

以上内容由遇见数据集搜集并总结生成