bethgelab/dataconcept_128M
收藏Hugging Face2026-02-15 更新2026-01-03 收录
下载链接:
https://hf-mirror.com/datasets/bethgelab/dataconcept_128M
下载链接
链接失效反馈官方服务:
资源简介:
DataConcept-128M是一个多模态预训练数据集,包含128M个从网络爬取的图像-文本对,这些数据源自DataComp-CLIP,并标注了关于其概念组成的细粒度细节。该数据集旨在实现概念感知批量采样(CABS),这是一种灵活的批量采样框架,可以根据视觉-语言预训练的特定目标分布动态构建批次。
与传统离线的、概念无关的数据整理方法不同,DataConcept支持:
- **任务自适应的在线概念整理** - 根据特定下游任务灵活采样数据
- **细粒度概念标注** - 每张图像包含边界框、对象类别、置信度分数以及合成的、替代文本和概念感知的标题
该数据集目前显著提升了CLIP和SigLIP模型在28个基准测试中的表现,并作为专有在线数据整理算法的强大开源替代方案。
DataConcept-128M is a multimodal pretraining dataset comprising 128M web-crawled image-text pairs, derived from DataComp-CLIP annotated with fine-grained details about their concept composition. This dataset is designed to enable Concept-Aware Batch Sampling (CABS), a flexible batch sampling framework that constructs batches on-the-fly based on specific target distributions for vision-language pretraining.
Unlike traditional offline, concept-agnostic data curation methods, DataConcept enables:
- **Task-adaptive online concept-based curation** - flexible data sampling tailored to specific downstream tasks
- **Fine-grained concept annotations** - each image includes bounding boxes, object classes, confidence scores, and synthetic, alt-text and concept-aware, captions
This dataset(currently) significantly improves CLIP and SigLIP model performance across 28 benchmarks and serves as a strong open-source alternative to proprietary online data curation algorithms.
提供机构:
bethgelab



