minhnguyent546/datacomp_large_vie
收藏Hugging Face2026-04-04 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/minhnguyent546/datacomp_large_vie
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: uid
dtype: string
- name: url
dtype: string
- name: text
dtype: string
- name: original_width
dtype: int64
- name: original_height
dtype: int64
- name: clip_b32_similarity_score
dtype: float32
- name: clip_l14_similarity_score
dtype: float32
- name: face_bboxes
list:
list: float64
- name: sha256
dtype: string
- name: lang
dtype: string
- name: lang_score
dtype: float32
- name: mclip_score
dtype: float64
- name: key
dtype: string
splits:
- name: train
num_bytes: 2580257090
num_examples: 6793921
download_size: 1813967425
dataset_size: 2580257090
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
license: cc-by-4.0
language:
- vi
---
# Dataset Overview
This is the filtered Vietnamese subset of the [DataComp Large Pool](https://huggingface.co/datasets/mlfoundations/datacomp_large).
The table below shows the processing steps applied to achieve this subset.
| Processing step | # Rows | % |
| --- | ----: | ---: |
| [DataComp Large Pool](https://huggingface.co/datasets/mlfoundations/datacomp_large) | 1,280,000,000 | - |
| Filtered for Vietnamese text using the [fasttext-language-identification](https://huggingface.co/facebook/fasttext-language-identification) model (score threshold > 0.7) | 26,537,765 | 100% |
| Removed samples with captions <= 10 characters or <= 3 words | 9,451,518 | 35.6% |
| Removed images with a smaller dimension below 200 pixels | 6,817,062 | 25.7% |
| Removed images with an aspect ratio >= 3 | 6,793,921 | 25.6% |
| Number of samples with an mclip_score | 3,950,377 | 14.9% |
The following table shows the percentiles based on the `mclip_score` computed using [clip-ViT-B-32-multilingual-v1](https://huggingface.co/sentence-transformers/clip-ViT-B-32-multilingual-v1).
| Percentile | mclip_score |
| :---: | ---: |
| 5th | 0.20242923 |
| 10th | 0.21163453 |
| 15th | 0.21813008 |
| 20th | 0.22349254 |
| 25th | 0.22828135 |
| 30th | 0.23274314 |
| 40th | 0.24120911 |
| 50th | 0.24967791 |
| 60th | 0.25872927 |
| 75th | 0.27487311 |
| 85th | 0.28946092 |
| 90th | 0.29953519 |
| 95th | 0.31465694 |
提供机构:
minhnguyent546



