five

Recap-DataComp-1B

收藏
魔搭社区2026-05-15 更新2024-06-25 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/Recap-DataComp-1B
下载链接
链接失效反馈
官方服务:
资源简介:
# Dataset Card for Recap-DataComp-1B <!-- Provide a quick summary of the dataset. --> Recap-DataComp-1B is a large-scale image-text dataset that has been recaptioned using an advanced LLaVA-1.5-LLaMA3-8B model to enhance the alignment and detail of textual descriptions. ## Dataset Details ### Dataset Description <!-- Provide a longer summary of what this dataset is. --> Our paper aims to bridge this community effort, leveraging the powerful and open-sourced LLaMA-3, a GPT-4 level LLM. Our recaptioning pipeline is simple: first, we fine-tune a LLaMA-3-8B powered LLaVA-1.5 and then employ it to recaption 1.3 billion images from the DataComp-1B dataset. Our empirical results confirm that this enhanced dataset, Recap-DataComp-1B, offers substantial benefits in training advanced vision-language models. For discriminative models like CLIP, we observe enhanced zero-shot performance in cross-modal retrieval tasks. For generative models like text-to-image Diffusion Transformers, the generated images exhibit a significant improvement in alignment with users' text instructions, especially in following complex queries. - **Curated by:** Xianhang Li, Haoqin Tu, Mude Hui, Zeyu Wang, Bingchen Zhao, Junfei Xiao, Sucheng Ren, Jieru Mei, Qing Liu, Huangjie Zheng, Yuyin Zhou, Cihang Xie - **License:** cc-by-4.0 ### Dataset Sources <!-- Provide the basic links for the dataset. --> - **Repository:** [https://github.com/UCSC-VLAA/Recap-DataComp-1B](https://github.com/UCSC-VLAA/Recap-DataComp-1B) - **Paper:** [https://arxiv.org/abs/2406.08478](https://arxiv.org/abs/2406.08478) ## Uses <!-- Address questions around how the dataset is intended to be used. --> ### Direct Use <!-- This section describes suitable use cases for the dataset. --> Recap-DataComp-1B is intended for training advanced vision-language models, including discriminative models like CLIP and generative models such as text-to-image Diffusion Transformers. It can be used for tasks such as zero-shot classification, cross-modal retrieval, and text-to-image generation. ### Out-of-Scope Use <!-- This section addresses misuse, malicious use, and uses that the dataset will not work well for. --> The dataset is not suitable for applications requiring highly accurate and sensitive personal data, as the recaptioned data may still contain noise and inaccuracies from the original web-crawled data. ## Dataset Structure <!-- This section provides a description of the dataset fields, and additional information about the dataset structure such as criteria used to create the splits, relationships between data points, etc. --> The dataset contains fields for image URLs, original captions, recaptioned text, and other metadata such as sha256 hashes. It is structured to facilitate easy access and use for training vision-language models. ## Dataset Creation ### Curation Rationale <!-- Motivation for the creation of this dataset. --> The dataset was created to address the noise and misalignment issues present in web-crawled image-text pairs, aiming to improve the performance of vision-language models by providing more semantically rich and well-aligned captions. ### Source Data <!-- This section describes the source data (e.g. news text and headlines, social media posts, translated sentences, ...). --> The source data is web-crawled image-text pairs from the DataComp-1B dataset, which has been curated from a larger collection of 12.8 billion image-text pairs. #### Data Collection and Processing <!-- This section describes the data collection and processing process such as data selection criteria, filtering and normalization methods, tools and libraries used, etc. --> Data was collected through web crawling and subjected to rigorous preprocessing, including safety checks, deduplication, and filtering based on CLIP scores and image-based criteria. The recaptioning was done using a fine-tuned LLaMA-3-8B powered LLaVA-1.5 model. ### Annotations <!-- If the dataset contains annotations which are not part of the initial data collection, use this section to describe them. --> #### Annotation process <!-- This section describes the annotation process such as annotation tools used in the process, the amount of data annotated, annotation guidelines provided to the annotators, interannotator statistics, annotation validation, etc. --> Annotations in the form of recaptioned text were generated using an advanced language model, LLaVA-1.5-LLaMA3-8B. The recaptioning process involved auto-regressive generation with greedy decoding, aimed at producing detailed and semantically rich captions. #### Who are the annotators? <!-- This section describes the people or systems who created the annotations. --> The annotations were generated by the LLaVA-1.5-LLaMA3-8B model. #### Personal and Sensitive Information <!-- State whether the dataset contains data that might be considered personal, sensitive, or private (e.g., data that reveals addresses, uniquely identifiable names or aliases, racial or ethnic origins, sexual orientations, religious beliefs, political opinions, financial or health data, etc.). If efforts were made to anonymize the data, describe the anonymization process. --> The dataset has undergone safety checks to filter out harmful content, but users should still exercise caution as some personal or sensitive information may be present due to the nature of web-crawled data. ## Bias, Risks, and Limitations <!-- This section is meant to convey both technical and sociotechnical limitations. --> While the recaptioned dataset aims to improve data quality, it may still contain biases and inaccuracies inherent in the original web-crawled data. Users should be aware of these limitations and the potential for misalignment or noise in the captions. ### Recommendations <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. --> Users should be made aware of the risks, biases and limitations of the dataset. More information needed for further recommendations. ## Citation <!-- If there is a paper or blog post introducing the dataset, the APA and Bibtex information for that should go in this section. --> **BibTeX:** ``` @article{li2024recaption, title={What If We Recaption Billions of Web Images with LLaMA-3?}, author={Xianhang Li and Haoqin Tu and Mude Hui and Zeyu Wang and Bingchen Zhao and Junfei Xiao and Sucheng Ren and Jieru Mei and Qing Liu and Huangjie Zheng and Yuyin Zhou and Cihang Xie}, journal={arXiv preprint arXiv:2406.08478}, year={2024} } ``` ## Acknowledgements This work is partially supported by a gift from Adobe, TPU Research Cloud (TRC) program, Google Cloud Research Credits program, AWS Cloud Credit for Research program, Edinburgh International Data Facility (EIDF) and the Data-Driven Innovation Programme at the University of Edinburgh. ## Dataset Card Authors Xianhang Li, Haoqin Tu, Mude Hui, Zeyu Wang, Bingchen Zhao, Junfei Xiao, Sucheng Ren, Jieru Mei, Qing Liu, Huangjie Zheng, Yuyin Zhou, Cihang Xie ## Dataset Card Contact xli421@ucsc.edu

# Recap-DataComp-1B 数据集卡片 <!-- 提供数据集的快速摘要。 --> Recap-DataComp-1B是一款大规模图文数据集,通过先进的LLaVA-1.5-LLaMA3-8B模型进行重标注,以提升文本描述的对齐度与细节丰富度。 ## 数据集详情 ### 数据集概述 <!-- 提供数据集的详细说明。 --> 本研究旨在依托功能强大且开源的LLaMA-3(达到GPT-4级别的大语言模型)推进相关社区工作。我们的重标注流程十分简洁:首先,对基于LLaMA-3-8B的LLaVA-1.5模型进行微调,随后使用该模型对DataComp-1B数据集中的13亿张图像进行重标注。实验结果证实,经优化后的Recap-DataComp-1B数据集在训练先进视觉语言模型方面具备显著优势。对于CLIP这类判别式模型,其在跨模态检索任务中的零样本性能得到提升;对于文本生成图像的扩散Transformer(Diffusion Transformers)这类生成式模型,生成图像与用户文本指令的对齐度显著改善,尤其在遵循复杂查询指令方面表现突出。 - **数据整理者:** Xianhang Li, Haoqin Tu, Mude Hui, Zeyu Wang, Bingchen Zhao, Junfei Xiao, Sucheng Ren, Jieru Mei, Qing Liu, Huangjie Zheng, Yuyin Zhou, Cihang Xie - **许可证:** cc-by-4.0 ### 数据集来源 <!-- 提供数据集的基础链接。 --> - **代码仓库:** [https://github.com/UCSC-VLAA/Recap-DataComp-1B](https://github.com/UCSC-VLAA/Recap-DataComp-1B) - **论文链接:** [https://arxiv.org/abs/2406.08478](https://arxiv.org/abs/2406.08478) ## 数据集用途 ### 直接用途 <!-- 本节描述数据集的适用场景。 --> Recap-DataComp-1B旨在用于训练先进的视觉语言模型,包括CLIP这类判别式模型,以及文本生成图像扩散Transformer这类生成式模型。其可应用于零样本分类、跨模态检索、文本生成图像等任务。 ### 不适用场景 <!-- 本节说明误用、恶意使用以及本数据集无法良好适配的使用场景。 --> 本数据集不适用于需要高精度、高敏感性个人数据的应用场景,因为重标注后的数据仍可能残留原始网络爬取数据中的噪声与不准确信息。 ## 数据集结构 <!-- 本节描述数据集的字段信息,以及数据集划分标准、数据点间关系等额外结构信息。 --> 本数据集包含图像URL、原始标注、重标注文本以及sha256哈希值等元数据字段,结构设计便于视觉语言模型训练场景下的便捷访问与使用。 ## 数据集构建 ### 数据整理依据 <!-- 说明创建本数据集的动机。 --> 本数据集旨在解决网络爬取图文对中存在的噪声与对齐度不足问题,通过提供语义更丰富、对齐更精准的标注,提升视觉语言模型的性能。 ### 源数据 <!-- 本节描述源数据(例如新闻文本与标题、社交媒体帖子、翻译语句等)。 --> 源数据取自DataComp-1B数据集的网络爬取图文对,该数据集源自128亿图文对的更大规模集合。 #### 数据收集与处理流程 <!-- 本节描述数据收集与处理过程,如数据选择标准、过滤与归一化方法、使用的工具与库等。 --> 数据通过网络爬取获取,并经过严格的预处理流程,包括安全检查、去重以及基于CLIP得分与图像标准的筛选。重标注环节采用了经微调的基于LLaMA-3-8B的LLaVA-1.5模型完成。 ### 标注信息 <!-- 若数据集包含非初始收集阶段的标注,请用本节描述相关信息。 --> #### 标注流程 <!-- 本节描述标注过程,如使用的标注工具、标注数据量、提供给标注者的标注指南、标注者间统计数据、标注验证等。 --> 重标注文本形式的标注通过先进的LLaVA-1.5-LLaMA3-8B模型生成。重标注过程采用自回归生成与贪婪解码策略,旨在生成细节丰富、语义饱满的标注文本。 #### 标注者信息 <!-- 本节描述创建标注的个人或系统。 --> 本数据集的标注由LLaVA-1.5-LLaMA3-8B模型自动生成。 #### 个人与敏感信息 <!-- 说明数据集是否包含可被视为个人、敏感或私密的数据(例如揭示地址、唯一可识别姓名或别名、种族或族裔出身、性取向、宗教信仰、政治观点、财务或健康数据等)。若已采取数据匿名化措施,请描述匿名化过程。 --> 本数据集已通过安全检查过滤有害内容,但由于原始数据为网络爬取所得,仍可能存在少量个人或敏感信息,使用者需谨慎处理。 ## 偏差、风险与局限性 <!-- 本节旨在说明技术与社会技术层面的局限性。 --> 尽管本重标注数据集旨在提升数据质量,但仍可能残留原始网络爬取数据固有的偏差与不准确信息。使用者需知晓此类局限性,以及标注中可能存在的对齐偏差或噪声问题。 ### 建议 <!-- 本节旨在针对偏差、风险与技术局限性给出相关建议。 --> 使用者需充分了解本数据集的风险、偏差与局限性,后续可进一步完善相关推荐策略。 ## 引用信息 <!-- 若有介绍本数据集的论文或博客文章,需在此处提供其APA与Bibtex格式引用信息。 --> **BibTeX:** @article{li2024recaption, title={What If We Recaption Billions of Web Images with LLaMA-3?}, author={Xianhang Li and Haoqin Tu and Mude Hui and Zeyu Wang and Bingchen Zhao and Junfei Xiao and Sucheng Ren and Jieru Mei and Qing Liu and Huangjie Zheng and Yuyin Zhou and Cihang Xie}, journal={arXiv preprint arXiv:2406.08478}, year={2024} } ## 致谢 本研究部分得到Adobe捐赠、TPU研究云(TPU Research Cloud, TRC)计划、谷歌云研究信用项目、AWS研究云信用项目、爱丁堡国际数据设施(Edinburgh International Data Facility, EIDF)以及爱丁堡大学数据驱动创新计划的支持。 ## 数据集卡片撰写者 Xianhang Li, Haoqin Tu, Mude Hui, Zeyu Wang, Bingchen Zhao, Junfei Xiao, Sucheng Ren, Jieru Mei, Qing Liu, Huangjie Zheng, Yuyin Zhou, Cihang Xie ## 数据集卡片联系人 xli421@ucsc.edu
提供机构:
maas
创建时间:
2024-06-21
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作