datacomp_large_vie_filtered2
收藏Hugging Face2026-03-16 更新2026-03-20 收录
下载链接:
https://huggingface.co/datasets/minhnguyent546/datacomp_large_vie_filtered2
下载链接
链接失效反馈官方服务:
资源简介:
该数据集包含6,793,921个训练样本,原始数据经过多步过滤处理:首先从9,451,518个样本中移除了较小尺寸(小于200像素)的图像,然后去除了宽高比大于等于3的图像。数据集包含多个特征字段:唯一标识符(uid)、图像URL(url)、文本内容(text)、原始图像尺寸(original_width/original_height)、两种CLIP模型相似度分数(clip_b32_similarity_score/clip_l14_similarity_score)、人脸边界框坐标(face_bboxes)、SHA256哈希值(sha256)、语言标识(lang)及其置信度(lang_score)、多语言CLIP分数(mclip_score)和键值(key)。其中mclip_score是使用clip-ViT-B-32-multilingual-v1模型计算的。数据集总大小为2.58GB,下载大小为1.81GB。
This dataset consists of 6,793,921 training samples. The raw data was processed via multi-step filtering: first, images smaller than 200 pixels in size were removed from the initial 9,451,518 samples, followed by the exclusion of images with an aspect ratio greater than or equal to 3. The dataset includes multiple feature fields: unique identifier (uid), image URL (url), text content (text), original image dimensions (original_width/original_height), similarity scores from two CLIP models (clip_b32_similarity_score/clip_l14_similarity_score), face bounding box coordinates (face_bboxes), SHA256 hash value (sha256), language identification (lang) and its confidence score (lang_score), multilingual CLIP score (mclip_score), and key. The mclip_score is calculated using the clip-ViT-B-32-multilingual-v1 model. The total size of the dataset is 2.58 GB, and its download size is 1.81 GB.
创建时间:
2026-03-11



