five

Spravil/g400m

收藏
Hugging Face2026-01-15 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Spravil/g400m
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 task_categories: - image-to-text language: - de - en task_ids: - image-captioning pretty_name: G400M size_categories: - 100M<n<1B source_datasets: - mlfoundations/datacomp_xlarge --- # Dataset Card for G400M G400M is a German language image-text dataset with 400M image-text pairs extracted from the [xlarge pool of DataComp](https://huggingface.co/datasets/mlfoundations/datacomp_xlarge). The data is filtered and balanced using the algorithm applied by [MetaCLIP](https://github.com/facebookresearch/MetaCLIP), that is: 1. Build a collection of 500k strings (namely the metadata) from the German Wikipedia. 2. Filter the data pool for German and English data using [fastText](https://github.com/facebookresearch/fastText). 3. Apply substring matching to the captions with the metadata. 4. Sample the image-text pairs using the algorithm by MetaCLIP with the (magic) target number per metadata entry of 20k. We follow [DataComp](https://huggingface.co/datasets/mlfoundations/datacomp_xlarge) and distribute the image url-text samples and metadata under a standard Creative Common CC-BY-4.0 license. The individual images are under their own copyrights.
提供机构:
Spravil
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作