dalle-mini/YFCC100M_OpenAI_subset

Name: dalle-mini/YFCC100M_OpenAI_subset
Creator: dalle-mini
Published: 2021-08-26 17:56:01
License: 暂无描述

Hugging Face2021-08-26 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/dalle-mini/YFCC100M_OpenAI_subset

下载链接

链接失效反馈

官方服务：

资源简介：

YFCC100M数据集的子集，由OpenAI用于CLIP项目，经过过滤仅包含能够检索到的图像。数据集分为训练集和验证集，训练集包含14,808,859个样本，大小为1.9 TB；验证集包含16,374个样本，大小为2.1 GB。数据集特征包括原始数据集中的字段（如标题、描述、照片ID等）和新增字段（如图像内容和经过清理的标题和描述）。

A subset of the YFCC100M dataset, which was employed by OpenAI for the CLIP project and filtered to retain only retrievable images, is divided into a training set and a validation set. The training set consists of 14,808,859 samples with a total size of 1.9 TB, while the validation set contains 16,374 samples with a total size of 2.1 GB. The dataset retains fields from the original dataset (e.g., title, description, photo ID, etc.) and additionally includes new fields such as image content, cleaned titles and cleaned descriptions.

提供机构：

dalle-mini

原始信息汇总

YFCC100M Subset from OpenAI

数据集概述

用途: 用于OpenAI的CLIP项目。
筛选条件: 仅包含可检索的图像。

数据集结构

分割	训练集	验证集
样本数量	14,808,859	16,374
大小	1.9 TB	2.1 GB

数据集特征

原始数据集特征: title, description, photoid, uid, unickname, datetaken, dateuploaded, capturedevice, usertags, machinetags, longitude, latitude, accuracy, pageurl, downloadurl, licensename, licenseurl, serverid, farmid, secret, secretoriginal, ext, marker, key
额外特征:
- img: 图像内容，可通过PIL.Image.open(io.BytesIO(item[img]))加载。
- title_clean 和 description_clean: 由title和description通过clean_text函数清洗得到。

文本清洗函数

python def clean_text(text): # 解码URL text = urllib.parse.unquote_plus(text) # 移除HTML标签 text = re.sub(<[^<]+?>, , text) # 移除多余空格及特殊字符 text = " ".join(text.split()) return text

搜集汇总

数据集介绍

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集