bethgelab/dataconcept_128M

Name: bethgelab/dataconcept_128M
Creator: bethgelab
Published: 2026-02-15 19:14:39
License: 暂无描述

Hugging Face2026-02-15 更新2026-01-03 收录

下载链接：

https://hf-mirror.com/datasets/bethgelab/dataconcept_128M

下载链接

链接失效反馈

官方服务：

资源简介：

DataConcept-128M是一个多模态预训练数据集，包含128M个从网络爬取的图像-文本对，这些数据源自DataComp-CLIP，并标注了关于其概念组成的细粒度细节。该数据集旨在实现概念感知批量采样（CABS），这是一种灵活的批量采样框架，可以根据视觉-语言预训练的特定目标分布动态构建批次。与传统离线的、概念无关的数据整理方法不同，DataConcept支持： - **任务自适应的在线概念整理** - 根据特定下游任务灵活采样数据 - **细粒度概念标注** - 每张图像包含边界框、对象类别、置信度分数以及合成的、替代文本和概念感知的标题该数据集目前显著提升了CLIP和SigLIP模型在28个基准测试中的表现，并作为专有在线数据整理算法的强大开源替代方案。

DataConcept-128M is a multimodal pretraining dataset comprising 128M web-crawled image-text pairs, derived from DataComp-CLIP annotated with fine-grained details about their concept composition. This dataset is designed to enable Concept-Aware Batch Sampling (CABS), a flexible batch sampling framework that constructs batches on-the-fly based on specific target distributions for vision-language pretraining. Unlike traditional offline, concept-agnostic data curation methods, DataConcept enables: - **Task-adaptive online concept-based curation** - flexible data sampling tailored to specific downstream tasks - **Fine-grained concept annotations** - each image includes bounding boxes, object classes, confidence scores, and synthetic, alt-text and concept-aware, captions This dataset(currently) significantly improves CLIP and SigLIP model performance across 28 benchmarks and serves as a strong open-source alternative to proprietary online data curation algorithms.

提供机构：

bethgelab

5,000+

优质数据集

54 个

任务类型

进入经典数据集