CaptionEmporium/coyo-hd-11m-llavanext
收藏数据集卡片 for coyo-hd-11m-llavanext
数据集描述
数据集概述
这是一个包含22,794,288条合成描述的数据集,对应11,397,144张图片,来源于coyo-700m。数据集标题中的“hd”代表高密度和高清晰度。尽管大型替代文本图像对数据集包含许多图像,但只有很小一部分图像具有较高的分辨率和丰富的概念密度。例如,这些数据集中超过50%的图像是缩略图大小或非常小的图像,仅包含一些文本或单一产品。为了改善这种低清晰度、低概念密度的图像问题,coyo-700m数据集的前4.5亿行被预过滤到最短边为512像素,然后通过两个多标签分类器进行处理。
描述是使用https://huggingface.co/lmms-lab/llama3-llava-next-8b生成的,随后使用Meta-Llama-3-8B进行清理和缩短。
语言
描述为英文。
数据实例
一个数据行示例:
json { "url": "https://images.nintendolife.com/cd4b7518ec8c2/large.jpg", "caption_llava": "A figurine of a character with green hair, wearing a white shirt, a black vest, and a gray cap, sitting with one hand on their knee and the other hand making a peace sign. The character is wearing a blue pendant and has a gold bracelet. In the background, there are green plants and a tree branch.", "caption_llava_short": "A green-haired character sits with a peace sign, wearing a blue pendant and gold bracelet, surrounded by green plants and a tree branch.", "caption": "Pokémon Center Reveals Official N And Zorua Figure, Pre-Orders Have Gone Live", "tags_open_images": "["Black", "Green", "White", "Animation"]", "tags_booru": "["bangs", "long_hair", "solo", "hat", "sitting", "jewelry", "necklace", "smile", "green_hair", "1boy", "tree", "pants", "shirt", "male_focus", "white_shirt", "bracelet", "ponytail", "baseball_cap", "black_shirt", "bangle", "branch", "index_finger_raised", "closed_mouth", "blurry", "blurry_background"]", "key": 25, "clip_similarity_vitb32": 0.1964111328125, "clip_similarity_vitl14": 0.259033203125, "nsfw_score_opennsfw2": 0.0290679931640625, "nsfw_score_gantman": 0.036349426954984665, "watermark_score": 0.0038619472179561853, "aesthetic_score_laion_v2": 5.079052925109863, "num_faces": 0, "width": 1280, "height": 789, "exif": "{}", "sha256": "dbec63de854341a189ba87d27dc04945e3d4fef0b0275f496ae16c79b723a157", }
数据分割
| train | |
|---|---|
| coyo-hd-11m-llavanext | 11397144 |
数据集创建
高概念过滤
图像通过两个多标签分类器ML_Decoder TResNet-M Open Images和mldanbooru进行标记,然后根据以下标准进行选择:
py def image_excluded(oi_tags, booru_tags): if (Product in oi_tags and no_humans in booru_tags) or (Text in oi_tags and no_humans in booru_tags and text_focus in booru_tags) or len(oi_tags) < 2 or len(booru_tags) < 3 or text-only_page in booru_tags: return True return False
这个简单的过滤器成功地移除了大部分低质量图像,这些图像仅包含没有背景的产品图像或仅包含文本的页面,如PowerPoint幻灯片。从数据集中大于512像素的2300万候选图像中,仅剩下1100万张图像。
多标签分类器的结果嵌入在行中作为tags_open_images和tags_booru,便于用于特定类别的下游任务。例如,如果您想在棒球帽上微调您的模型,您可以查找“baseball_cap”标签。
生成描述
https://huggingface.co/lmms-lab/llama3-llava-next-8b通过以下提示生成描述:
py prompt_gen = lambda txt :f""" Please make a detailed but succinct caption of this image. If you see text or objects, be sure to describe them in detail along with any other aspects of the foreground and background. As a hint, here is the alt-text attribute of the image, which may or may not have to do with the image:
Hint:
{txt}
"""
这大约有2.7%的时间会产生失败。失败定义为:
- 包含以下重复文本之一:
to_reformats = [ no text, other objects, additional objects, no objects , alt-text]。 - 包含重复序列。
这些描述通过Meta-Llama-3-8B进行重新格式化以修复重复或移除这些提及的内容。然后,如anime-caption-danbooru-2021-sfw-5m-hq中所示,修剪前缀。
短描述是通过以下提示在Meta-Llama-3-8B中生成的:
py prompt = lambda img_prompt: f""" Please take the following image caption and attempt to distill it into a single sentence. Remove any redundant lines or descriptions and make it a maximum of 30 words in length.
{img_prompt}
Please only write the caption and no other text. """
源数据
通过访问coyo-700m中的URL获取。
偏见讨论
数据集将偏向于多标签分类器识别的概念。
已知限制
可能仍存在非常少量的错误描述,但绝大多数已被消除。数据集尚未进行安全性评估,而是依赖于Cacao Brain的NSFW过滤方案。
blurry和watermark标签未被过滤。在第一种情况下,带有散景的图像通常会触发blurry标签,不应被排除。在第二种情况下,有许多机器学习任务对水印的存在不敏感,数据集中提供的标签提供了过滤它们的简单方法。
提示:如果您正在训练文本到图像的扩散模型,请仅使用带有水印的图像进行无条件训练。无分类器指导将优先创建没有水印的图像。
附加信息
数据集策展人
Caption Emporium
许可信息
数据集在Creative Commons ShareAlike (CC BY-SA 4.0)下可用。
引用信息
@misc{coyo-hd-11m-llavanext, author = { Caption Emporium }, title = { coyo-hd-11m-llavanext }, year = { 2024 }, publisher = { Huggingface }, journal = { Huggingface repository }, howpublished = {url{https://huggingface.co/datasets/CaptionEmporium/coyo-hd-11m-llavanext}}, }



