minhnguyent546/datacomp_large_vie_filtered2
收藏Hugging Face2026-03-18 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/minhnguyent546/datacomp_large_vie_filtered2
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: uid
dtype: string
- name: url
dtype: string
- name: text
dtype: string
- name: original_width
dtype: int64
- name: original_height
dtype: int64
- name: clip_b32_similarity_score
dtype: float32
- name: clip_l14_similarity_score
dtype: float32
- name: face_bboxes
list:
list: float64
- name: sha256
dtype: string
- name: lang
dtype: string
- name: lang_score
dtype: float32
- name: mclip_score
dtype: float64
- name: key
dtype: string
splits:
- name: train
num_bytes: 2580257090
num_examples: 6793921
download_size: 1813967425
dataset_size: 2580257090
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
---
| Process step | # Samples (remain) | % |
| --- | ----: | ---: |
| <emp>[Original filtered split](https://huggingface.co/datasets/minhnguyent546/datacomp_large_vie_filtered)</emp> | 9,451,518 | 100% |
| Removed images with smaller dimension below 200 | 6,817,062 | 72.13% |
| Remove images with aspect ratio >= 3 | 6,793,921 | 71.88% |
| Number of rows having mclip_score | 3,950,377 | 41.80% |
| Percentile | mclip_score |
| :---: | ---: |
5th | 0.20242923 |
10th | 0.21163453 |
15th | 0.21813008 |
20th | 0.22349254 |
25th | 0.22828135 |
30th | 0.23274314 |
40th | 0.24120911 |
50th | 0.24967791 |
60th | 0.25872927 |
75th | 0.27487311 |
85th | 0.28946092 |
90th | 0.29953519 |
95th | 0.31465694 |
**Notes:**
- `mclip_score` is computed using [clip-ViT-B-32-multilingual-v1](https://huggingface.co/sentence-transformers/clip-ViT-B-32-multilingual-v1).
提供机构:
minhnguyent546



