LAION-5B

OpenDataLab2026-04-05 更新2024-05-09 收录

下载链接：

https://opendatalab.org.cn/OpenDataLab/LAION-5B

下载链接

链接失效反馈

资源简介：

LAION 5B 是一个用于研究目的的大规模图文数据集。由58.5亿个CLIP过滤的图像-文本对组成，其中包含23.2亿的英语，22.6亿的样本来自100多种其他语言，及12.7亿的未知样本。此外，发布方提供了几个最近邻索引、用于探索和子集创建的改进Web界面以及水印和NSFW的检测分数。 OpenDataLab 网站提供了处理好的parquet文件，研究者可以下载，根据这份元数据下载对应的图片文件。现在，我们也开源了LAION-5B图片下载代码，github开源地址如下：https://github.com/opendatalab/laion5b-downloader

LAION 5B is a large-scale image-text dataset intended for research purposes. It consists of 5.85 billion CLIP-filtered image-text pairs, including 2.32 billion English pairs, 2.26 billion samples sourced from over 100 other languages, and 1.27 billion samples with unknown language information. Additionally, the dataset publishers have released several nearest neighbor indexes, an enhanced web interface for exploration and subset creation, as well as watermark and NSFW detection scores. The OpenDataLab website provides preprocessed Parquet files for researchers to download, and corresponding image files can be retrieved using the accompanying metadata. We have also open-sourced the LAION-5B image download code, with its GitHub repository available at: https://github.com/opendatalab/laion5b-downloader

提供机构：

OpenDataLab

创建时间：

2022-10-08

AI搜集汇总

数据集介绍