CaptionEmporium/conceptual-captions-cc12m-llavanext
收藏数据集卡片 for conceptual-captions-cc12m-llavanext
数据集描述
- 联系人: Caption Emporium
数据集概述
这是一个包含21,930,344条合成字幕的数据集,对应10,965,172张图片,源自conceptual_12m。为了确保可重复性,使用了Huggingface上的存档(cc12m-wds)。字幕是通过https://huggingface.co/lmms-lab/llama3-llava-next-8b以float16推理生成的,随后使用Meta-Llama-3-8B进行清理和缩短。
语言
字幕为英语。
数据实例
一个数据行示例:
json { "caption_llava":"Two loaves of golden brown Cuban bread, one slightly overlapping the other, resting on a white surface, with a focus on the crusts texture and the hint of a tropical setting.", "caption_llava_short":"Golden brown Cuban bread loaves rest on a white surface, showcasing their textured crust and hinting at a tropical setting. ", "caption":"This is the best recipe I have ever tried for Cuban bread. I lived in Key West... Cuban Recipes, Bread Recipes, Cooking Recipes, Cuban Desserts, Pan Cubano Recipe, Cuban Bread, Cuban Sandwich, Sandwiches, Recipe From Scratch", "url":"https://i.pinimg.com/originals/da/5e/76/da5e7622c119c4c96b9e42e7e2a667a0.jpg", "key":"000000001", "status":"success", "error_message":"None", "width":555, "height":416, "exif":"{}", "original_width":555, }
数据分割
| train | |
|---|---|
| conceptual-captions-cc12m-llavanext | 10965172 |
数据集创建
生成字幕
https://huggingface.co/lmms-lab/llama3-llava-next-8b 被以下提示生成字幕:
py prompt_gen = lambda txt :f""" Please make a detailed but succinct caption of this image. If you see text or objects, be sure to describe them in detail along with any other aspects of the foreground and background. As a hint, here is the alt-text attribute of the image, which may or may not have to do with the image:
Hint:
{txt}
"""
这产生了大约2.6%的失败字幕。失败定义为:
- 包含以下重复文本之一:
to_reformats = [ no text, other objects, additional objects, no objects , alt-text]。 - 包含重复序列。
这些字幕通过Meta-Llama-3-8B重新格式化以修复重复或移除这些提及。然后,如anime-caption-danbooru-2021-sfw-5m-hq中所述修剪前缀。
短字幕是通过以下提示在Meta-Llama-3-8B中生成的:
py prompt = lambda img_prompt: f""" Please take the following image caption and attempt to distill it into a single sentence. Remove any redundant lines or descriptions and make it a maximum of 30 words in length.
{img_prompt}
Please only write the caption and no other text. """
源数据
偏见讨论
请参考原始conceptual_12m仓库。字幕可能高度依赖于图像的alt-text和视觉语言模型的训练数据。
已知限制
可能仍存在极少数错误字幕,但绝大多数已被消除。极小部分图像未能生成字幕(约0.0018%),这些图像的字幕为"An image"。
附加信息
数据集策展人
Caption Emporium
许可信息
数据集在Creative Commons ShareAlike (CC BY-SA 4.0)下提供。根据原始数据集仓库,Google LLC ("Google")被认可为原始数据集的聚合者。
特别感谢
以下人员提供了计算资源以协助字幕生成:
引用信息
@misc{conceptual-captions-cc12m-llavanext, author = { Caption Emporium }, title = { conceptual-captions-cc12m-llavanext }, year = { 2024 }, publisher = { Huggingface }, journal = { Huggingface repository }, howpublished = {url{https://huggingface.co/datasets/CaptionEmporium/conceptual-captions-cc12m-llavanext}}, }



