five

CaptionEmporium/conceptual-captions-cc12m-llavanext

收藏
Hugging Face2024-06-30 更新2024-06-15 收录
下载链接:
https://hf-mirror.com/datasets/CaptionEmporium/conceptual-captions-cc12m-llavanext
下载链接
链接失效反馈
官方服务:
资源简介:
该数据集包含21,930,344条合成字幕,对应10,965,172张图片,这些字幕是通过llama3-llava-next-8b模型生成的,并经过Meta-Llama-3-8B模型的清理和缩短处理。字幕语言为英语,数据实例展示了单行数据的结构。数据集创建过程中使用了特定的提示生成和清理方法,以确保字幕的质量。源数据来自cc12m-wds数据集,并讨论了可能的偏差和已知限制。
提供机构:
CaptionEmporium
原始信息汇总

数据集卡片 for conceptual-captions-cc12m-llavanext

数据集描述

  • 联系人: Caption Emporium

数据集概述

这是一个包含21,930,344条合成字幕的数据集,对应10,965,172张图片,源自conceptual_12m。为了确保可重复性,使用了Huggingface上的存档(cc12m-wds)。字幕是通过https://huggingface.co/lmms-lab/llama3-llava-next-8b以float16推理生成的,随后使用Meta-Llama-3-8B进行清理和缩短。

语言

字幕为英语。

数据实例

一个数据行示例:

json { "caption_llava":"Two loaves of golden brown Cuban bread, one slightly overlapping the other, resting on a white surface, with a focus on the crusts texture and the hint of a tropical setting.", "caption_llava_short":"Golden brown Cuban bread loaves rest on a white surface, showcasing their textured crust and hinting at a tropical setting. ", "caption":"This is the best recipe I have ever tried for Cuban bread. I lived in Key West... Cuban Recipes, Bread Recipes, Cooking Recipes, Cuban Desserts, Pan Cubano Recipe, Cuban Bread, Cuban Sandwich, Sandwiches, Recipe From Scratch", "url":"https://i.pinimg.com/originals/da/5e/76/da5e7622c119c4c96b9e42e7e2a667a0.jpg", "key":"000000001", "status":"success", "error_message":"None", "width":555, "height":416, "exif":"{}", "original_width":555, }

数据分割

train
conceptual-captions-cc12m-llavanext 10965172

数据集创建

生成字幕

https://huggingface.co/lmms-lab/llama3-llava-next-8b 被以下提示生成字幕:

py prompt_gen = lambda txt :f""" Please make a detailed but succinct caption of this image. If you see text or objects, be sure to describe them in detail along with any other aspects of the foreground and background. As a hint, here is the alt-text attribute of the image, which may or may not have to do with the image:

Hint:

{txt}

"""

这产生了大约2.6%的失败字幕。失败定义为:

  1. 包含以下重复文本之一:to_reformats = [ no text, other objects, additional objects, no objects , alt-text]
  2. 包含重复序列。

这些字幕通过Meta-Llama-3-8B重新格式化以修复重复或移除这些提及。然后,如anime-caption-danbooru-2021-sfw-5m-hq中所述修剪前缀。

短字幕是通过以下提示在Meta-Llama-3-8B中生成的:

py prompt = lambda img_prompt: f""" Please take the following image caption and attempt to distill it into a single sentence. Remove any redundant lines or descriptions and make it a maximum of 30 words in length.

{img_prompt}

Please only write the caption and no other text. """

源数据

cc12m-wds

偏见讨论

请参考原始conceptual_12m仓库。字幕可能高度依赖于图像的alt-text和视觉语言模型的训练数据。

已知限制

可能仍存在极少数错误字幕,但绝大多数已被消除。极小部分图像未能生成字幕(约0.0018%),这些图像的字幕为"An image"。

附加信息

数据集策展人

Caption Emporium

许可信息

数据集在Creative Commons ShareAlike (CC BY-SA 4.0)下提供。根据原始数据集仓库,Google LLC ("Google")被认可为原始数据集的聚合者。

特别感谢

以下人员提供了计算资源以协助字幕生成:

引用信息

@misc{conceptual-captions-cc12m-llavanext, author = { Caption Emporium }, title = { conceptual-captions-cc12m-llavanext }, year = { 2024 }, publisher = { Huggingface }, journal = { Huggingface repository }, howpublished = {url{https://huggingface.co/datasets/CaptionEmporium/conceptual-captions-cc12m-llavanext}}, }

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作