five

minhnguyent546/datacomp_large_vie

收藏
Hugging Face2026-04-04 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/minhnguyent546/datacomp_large_vie
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: uid dtype: string - name: url dtype: string - name: text dtype: string - name: original_width dtype: int64 - name: original_height dtype: int64 - name: clip_b32_similarity_score dtype: float32 - name: clip_l14_similarity_score dtype: float32 - name: face_bboxes list: list: float64 - name: sha256 dtype: string - name: lang dtype: string - name: lang_score dtype: float32 - name: mclip_score dtype: float64 - name: key dtype: string splits: - name: train num_bytes: 2580257090 num_examples: 6793921 download_size: 1813967425 dataset_size: 2580257090 configs: - config_name: default data_files: - split: train path: data/train-* license: cc-by-4.0 language: - vi --- # Dataset Overview This is the filtered Vietnamese subset of the [DataComp Large Pool](https://huggingface.co/datasets/mlfoundations/datacomp_large). The table below shows the processing steps applied to achieve this subset. | Processing step | # Rows | % | | --- | ----: | ---: | | [DataComp Large Pool](https://huggingface.co/datasets/mlfoundations/datacomp_large) | 1,280,000,000 | - | | Filtered for Vietnamese text using the [fasttext-language-identification](https://huggingface.co/facebook/fasttext-language-identification) model (score threshold > 0.7) | 26,537,765 | 100% | | Removed samples with captions <= 10 characters or <= 3 words | 9,451,518 | 35.6% | | Removed images with a smaller dimension below 200 pixels | 6,817,062 | 25.7% | | Removed images with an aspect ratio >= 3 | 6,793,921 | 25.6% | | Number of samples with an mclip_score | 3,950,377 | 14.9% | The following table shows the percentiles based on the `mclip_score` computed using [clip-ViT-B-32-multilingual-v1](https://huggingface.co/sentence-transformers/clip-ViT-B-32-multilingual-v1). | Percentile | mclip_score | | :---: | ---: | | 5th | 0.20242923 | | 10th | 0.21163453 | | 15th | 0.21813008 | | 20th | 0.22349254 | | 25th | 0.22828135 | | 30th | 0.23274314 | | 40th | 0.24120911 | | 50th | 0.24967791 | | 60th | 0.25872927 | | 75th | 0.27487311 | | 85th | 0.28946092 | | 90th | 0.29953519 | | 95th | 0.31465694 |
提供机构:
minhnguyent546
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作