five

bethgelab/dataconcept_128M

收藏
Hugging Face2026-02-15 更新2026-01-03 收录
下载链接:
https://hf-mirror.com/datasets/bethgelab/dataconcept_128M
下载链接
链接失效反馈
官方服务:
资源简介:
DataConcept-128M是一个多模态预训练数据集,包含128M个从网络爬取的图像-文本对,这些数据源自DataComp-CLIP,并标注了关于其概念组成的细粒度细节。该数据集旨在实现概念感知批量采样(CABS),这是一种灵活的批量采样框架,可以根据视觉-语言预训练的特定目标分布动态构建批次。 与传统离线的、概念无关的数据整理方法不同,DataConcept支持: - **任务自适应的在线概念整理** - 根据特定下游任务灵活采样数据 - **细粒度概念标注** - 每张图像包含边界框、对象类别、置信度分数以及合成的、替代文本和概念感知的标题 该数据集目前显著提升了CLIP和SigLIP模型在28个基准测试中的表现,并作为专有在线数据整理算法的强大开源替代方案。

DataConcept-128M is a multimodal pretraining dataset comprising 128M web-crawled image-text pairs, derived from DataComp-CLIP annotated with fine-grained details about their concept composition. This dataset is designed to enable Concept-Aware Batch Sampling (CABS), a flexible batch sampling framework that constructs batches on-the-fly based on specific target distributions for vision-language pretraining. Unlike traditional offline, concept-agnostic data curation methods, DataConcept enables: - **Task-adaptive online concept-based curation** - flexible data sampling tailored to specific downstream tasks - **Fine-grained concept annotations** - each image includes bounding boxes, object classes, confidence scores, and synthetic, alt-text and concept-aware, captions This dataset(currently) significantly improves CLIP and SigLIP model performance across 28 benchmarks and serves as a strong open-source alternative to proprietary online data curation algorithms.
提供机构:
bethgelab
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作