Spravil/g400m

Name: Spravil/g400m
Creator: Spravil
Published: 2026-01-15 13:31:21
License: 暂无描述

Hugging Face2026-01-15 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/Spravil/g400m

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-4.0 task_categories: - image-to-text language: - de - en task_ids: - image-captioning pretty_name: G400M size_categories: - 100M<n<1B source_datasets: - mlfoundations/datacomp_xlarge --- # Dataset Card for G400M G400M is a German language image-text dataset with 400M image-text pairs extracted from the [xlarge pool of DataComp](https://huggingface.co/datasets/mlfoundations/datacomp_xlarge). The data is filtered and balanced using the algorithm applied by [MetaCLIP](https://github.com/facebookresearch/MetaCLIP), that is: 1. Build a collection of 500k strings (namely the metadata) from the German Wikipedia. 2. Filter the data pool for German and English data using [fastText](https://github.com/facebookresearch/fastText). 3. Apply substring matching to the captions with the metadata. 4. Sample the image-text pairs using the algorithm by MetaCLIP with the (magic) target number per metadata entry of 20k. We follow [DataComp](https://huggingface.co/datasets/mlfoundations/datacomp_xlarge) and distribute the image url-text samples and metadata under a standard Creative Common CC-BY-4.0 license. The individual images are under their own copyrights.

提供机构：

Spravil

5,000+

优质数据集

54 个

任务类型

进入经典数据集