Recap-DataComp-1B

Name: Recap-DataComp-1B
Creator: maas
Published: 2026-05-15 17:41:01
License: 暂无描述

魔搭社区2026-05-15 更新2024-06-25 收录

下载链接：

https://modelscope.cn/datasets/AI-ModelScope/Recap-DataComp-1B

下载链接

链接失效反馈

官方服务：

资源简介：

# Dataset Card for Recap-DataComp-1B  Recap-DataComp-1B is a large-scale image-text dataset that has been recaptioned using an advanced LLaVA-1.5-LLaMA3-8B model to enhance the alignment and detail of textual descriptions. ## Dataset Details ### Dataset Description  Our paper aims to bridge this community effort, leveraging the powerful and open-sourced LLaMA-3, a GPT-4 level LLM. Our recaptioning pipeline is simple: first, we fine-tune a LLaMA-3-8B powered LLaVA-1.5 and then employ it to recaption 1.3 billion images from the DataComp-1B dataset. Our empirical results confirm that this enhanced dataset, Recap-DataComp-1B, offers substantial benefits in training advanced vision-language models. For discriminative models like CLIP, we observe enhanced zero-shot performance in cross-modal retrieval tasks. For generative models like text-to-image Diffusion Transformers, the generated images exhibit a significant improvement in alignment with users' text instructions, especially in following complex queries. - **Curated by:** Xianhang Li, Haoqin Tu, Mude Hui, Zeyu Wang, Bingchen Zhao, Junfei Xiao, Sucheng Ren, Jieru Mei, Qing Liu, Huangjie Zheng, Yuyin Zhou, Cihang Xie - **License:** cc-by-4.0 ### Dataset Sources  - **Repository:** [https://github.com/UCSC-VLAA/Recap-DataComp-1B](https://github.com/UCSC-VLAA/Recap-DataComp-1B) - **Paper:** [https://arxiv.org/abs/2406.08478](https://arxiv.org/abs/2406.08478) ## Uses  ### Direct Use  Recap-DataComp-1B is intended for training advanced vision-language models, including discriminative models like CLIP and generative models such as text-to-image Diffusion Transformers. It can be used for tasks such as zero-shot classification, cross-modal retrieval, and text-to-image generation. ### Out-of-Scope Use  The dataset is not suitable for applications requiring highly accurate and sensitive personal data, as the recaptioned data may still contain noise and inaccuracies from the original web-crawled data. ## Dataset Structure  The dataset contains fields for image URLs, original captions, recaptioned text, and other metadata such as sha256 hashes. It is structured to facilitate easy access and use for training vision-language models. ## Dataset Creation ### Curation Rationale  The dataset was created to address the noise and misalignment issues present in web-crawled image-text pairs, aiming to improve the performance of vision-language models by providing more semantically rich and well-aligned captions. ### Source Data  The source data is web-crawled image-text pairs from the DataComp-1B dataset, which has been curated from a larger collection of 12.8 billion image-text pairs. #### Data Collection and Processing  Data was collected through web crawling and subjected to rigorous preprocessing, including safety checks, deduplication, and filtering based on CLIP scores and image-based criteria. The recaptioning was done using a fine-tuned LLaMA-3-8B powered LLaVA-1.5 model. ### Annotations  #### Annotation process  Annotations in the form of recaptioned text were generated using an advanced language model, LLaVA-1.5-LLaMA3-8B. The recaptioning process involved auto-regressive generation with greedy decoding, aimed at producing detailed and semantically rich captions. #### Who are the annotators?  The annotations were generated by the LLaVA-1.5-LLaMA3-8B model. #### Personal and Sensitive Information  The dataset has undergone safety checks to filter out harmful content, but users should still exercise caution as some personal or sensitive information may be present due to the nature of web-crawled data. ## Bias, Risks, and Limitations  While the recaptioned dataset aims to improve data quality, it may still contain biases and inaccuracies inherent in the original web-crawled data. Users should be aware of these limitations and the potential for misalignment or noise in the captions. ### Recommendations  Users should be made aware of the risks, biases and limitations of the dataset. More information needed for further recommendations. ## Citation  **BibTeX:** ``` @article{li2024recaption, title={What If We Recaption Billions of Web Images with LLaMA-3?}, author={Xianhang Li and Haoqin Tu and Mude Hui and Zeyu Wang and Bingchen Zhao and Junfei Xiao and Sucheng Ren and Jieru Mei and Qing Liu and Huangjie Zheng and Yuyin Zhou and Cihang Xie}, journal={arXiv preprint arXiv:2406.08478}, year={2024} } ``` ## Acknowledgements This work is partially supported by a gift from Adobe, TPU Research Cloud (TRC) program, Google Cloud Research Credits program, AWS Cloud Credit for Research program, Edinburgh International Data Facility (EIDF) and the Data-Driven Innovation Programme at the University of Edinburgh. ## Dataset Card Authors Xianhang Li, Haoqin Tu, Mude Hui, Zeyu Wang, Bingchen Zhao, Junfei Xiao, Sucheng Ren, Jieru Mei, Qing Liu, Huangjie Zheng, Yuyin Zhou, Cihang Xie ## Dataset Card Contact xli421@ucsc.edu

# Recap-DataComp-1B 数据集卡片  Recap-DataComp-1B是一款大规模图文数据集，通过先进的LLaVA-1.5-LLaMA3-8B模型进行重标注，以提升文本描述的对齐度与细节丰富度。 ## 数据集详情 ### 数据集概述  本研究旨在依托功能强大且开源的LLaMA-3（达到GPT-4级别的大语言模型）推进相关社区工作。我们的重标注流程十分简洁：首先，对基于LLaMA-3-8B的LLaVA-1.5模型进行微调，随后使用该模型对DataComp-1B数据集中的13亿张图像进行重标注。实验结果证实，经优化后的Recap-DataComp-1B数据集在训练先进视觉语言模型方面具备显著优势。对于CLIP这类判别式模型，其在跨模态检索任务中的零样本性能得到提升；对于文本生成图像的扩散Transformer（Diffusion Transformers）这类生成式模型，生成图像与用户文本指令的对齐度显著改善，尤其在遵循复杂查询指令方面表现突出。 - **数据整理者：** Xianhang Li, Haoqin Tu, Mude Hui, Zeyu Wang, Bingchen Zhao, Junfei Xiao, Sucheng Ren, Jieru Mei, Qing Liu, Huangjie Zheng, Yuyin Zhou, Cihang Xie - **许可证：** cc-by-4.0 ### 数据集来源  - **代码仓库：** [https://github.com/UCSC-VLAA/Recap-DataComp-1B](https://github.com/UCSC-VLAA/Recap-DataComp-1B) - **论文链接：** [https://arxiv.org/abs/2406.08478](https://arxiv.org/abs/2406.08478) ## 数据集用途 ### 直接用途  Recap-DataComp-1B旨在用于训练先进的视觉语言模型，包括CLIP这类判别式模型，以及文本生成图像扩散Transformer这类生成式模型。其可应用于零样本分类、跨模态检索、文本生成图像等任务。 ### 不适用场景  本数据集不适用于需要高精度、高敏感性个人数据的应用场景，因为重标注后的数据仍可能残留原始网络爬取数据中的噪声与不准确信息。 ## 数据集结构  本数据集包含图像URL、原始标注、重标注文本以及sha256哈希值等元数据字段，结构设计便于视觉语言模型训练场景下的便捷访问与使用。 ## 数据集构建 ### 数据整理依据  本数据集旨在解决网络爬取图文对中存在的噪声与对齐度不足问题，通过提供语义更丰富、对齐更精准的标注，提升视觉语言模型的性能。 ### 源数据  源数据取自DataComp-1B数据集的网络爬取图文对，该数据集源自128亿图文对的更大规模集合。 #### 数据收集与处理流程  数据通过网络爬取获取，并经过严格的预处理流程，包括安全检查、去重以及基于CLIP得分与图像标准的筛选。重标注环节采用了经微调的基于LLaMA-3-8B的LLaVA-1.5模型完成。 ### 标注信息  #### 标注流程  重标注文本形式的标注通过先进的LLaVA-1.5-LLaMA3-8B模型生成。重标注过程采用自回归生成与贪婪解码策略，旨在生成细节丰富、语义饱满的标注文本。 #### 标注者信息  本数据集的标注由LLaVA-1.5-LLaMA3-8B模型自动生成。 #### 个人与敏感信息  本数据集已通过安全检查过滤有害内容，但由于原始数据为网络爬取所得，仍可能存在少量个人或敏感信息，使用者需谨慎处理。 ## 偏差、风险与局限性  尽管本重标注数据集旨在提升数据质量，但仍可能残留原始网络爬取数据固有的偏差与不准确信息。使用者需知晓此类局限性，以及标注中可能存在的对齐偏差或噪声问题。 ### 建议  使用者需充分了解本数据集的风险、偏差与局限性，后续可进一步完善相关推荐策略。 ## 引用信息  **BibTeX：** @article{li2024recaption, title={What If We Recaption Billions of Web Images with LLaMA-3?}, author={Xianhang Li and Haoqin Tu and Mude Hui and Zeyu Wang and Bingchen Zhao and Junfei Xiao and Sucheng Ren and Jieru Mei and Qing Liu and Huangjie Zheng and Yuyin Zhou and Cihang Xie}, journal={arXiv preprint arXiv:2406.08478}, year={2024} } ## 致谢本研究部分得到Adobe捐赠、TPU研究云（TPU Research Cloud, TRC）计划、谷歌云研究信用项目、AWS研究云信用项目、爱丁堡国际数据设施（Edinburgh International Data Facility, EIDF）以及爱丁堡大学数据驱动创新计划的支持。 ## 数据集卡片撰写者 Xianhang Li, Haoqin Tu, Mude Hui, Zeyu Wang, Bingchen Zhao, Junfei Xiao, Sucheng Ren, Jieru Mei, Qing Liu, Huangjie Zheng, Yuyin Zhou, Cihang Xie ## 数据集卡片联系人 xli421@ucsc.edu

提供机构：

maas

创建时间：

2024-06-21

5,000+

优质数据集

54 个

任务类型

进入经典数据集