five

CaptionEmporium/coyo-hd-11m-llavanext

收藏
Hugging Face2024-07-06 更新2024-06-25 收录
下载链接:
https://hf-mirror.com/datasets/CaptionEmporium/coyo-hd-11m-llavanext
下载链接
链接失效反馈
官方服务:
资源简介:
该数据集包含从[coyo-700m](https://huggingface.co/datasets/kakaobrain/coyo-700m)中筛选出的11,397,144张图像及其对应的22,794,288条合成描述。描述通过[llama3-llava-next-8b](https://huggingface.co/lmms-lab/llama3-llava-next-8b)模型生成,并经过清理和缩短。数据集中的图像经过高密度和高清晰度的筛选,确保图像质量和概念密度。数据集主要用于文本到图像和图像到文本的任务,包含多种标签和分类器结果,便于下游任务的使用。

This dataset contains 11,397,144 images and their corresponding 22,794,288 synthetic captions, filtered from [coyo-700m](https://huggingface.co/datasets/kakaobrain/coyo-700m). The captions were generated using the [llama3-llava-next-8b](https://huggingface.co/lmms-lab/llama3-llava-next-8b) model, followed by cleanup and shortening. The images in the dataset were filtered for high density and high definition, ensuring image quality and concept density. The dataset is primarily used for text-to-image and image-to-text tasks, containing various tags and classifier results, making it suitable for downstream tasks.
提供机构:
CaptionEmporium
原始信息汇总

数据集卡片 for coyo-hd-11m-llavanext

数据集描述

数据集概述

这是一个包含22,794,288条合成描述的数据集,对应11,397,144张图片,来源于coyo-700m。数据集标题中的“hd”代表高密度高清晰度。尽管大型替代文本图像对数据集包含许多图像,但只有很小一部分图像具有较高的分辨率和丰富的概念密度。例如,这些数据集中超过50%的图像是缩略图大小或非常小的图像,仅包含一些文本或单一产品。为了改善这种低清晰度、低概念密度的图像问题,coyo-700m数据集的前4.5亿行被预过滤到最短边为512像素,然后通过两个多标签分类器进行处理。

描述是使用https://huggingface.co/lmms-lab/llama3-llava-next-8b生成的,随后使用Meta-Llama-3-8B进行清理和缩短。

语言

描述为英文。

数据实例

一个数据行示例:

json { "url": "https://images.nintendolife.com/cd4b7518ec8c2/large.jpg", "caption_llava": "A figurine of a character with green hair, wearing a white shirt, a black vest, and a gray cap, sitting with one hand on their knee and the other hand making a peace sign. The character is wearing a blue pendant and has a gold bracelet. In the background, there are green plants and a tree branch.", "caption_llava_short": "A green-haired character sits with a peace sign, wearing a blue pendant and gold bracelet, surrounded by green plants and a tree branch.", "caption": "Pokémon Center Reveals Official N And Zorua Figure, Pre-Orders Have Gone Live", "tags_open_images": "["Black", "Green", "White", "Animation"]", "tags_booru": "["bangs", "long_hair", "solo", "hat", "sitting", "jewelry", "necklace", "smile", "green_hair", "1boy", "tree", "pants", "shirt", "male_focus", "white_shirt", "bracelet", "ponytail", "baseball_cap", "black_shirt", "bangle", "branch", "index_finger_raised", "closed_mouth", "blurry", "blurry_background"]", "key": 25, "clip_similarity_vitb32": 0.1964111328125, "clip_similarity_vitl14": 0.259033203125, "nsfw_score_opennsfw2": 0.0290679931640625, "nsfw_score_gantman": 0.036349426954984665, "watermark_score": 0.0038619472179561853, "aesthetic_score_laion_v2": 5.079052925109863, "num_faces": 0, "width": 1280, "height": 789, "exif": "{}", "sha256": "dbec63de854341a189ba87d27dc04945e3d4fef0b0275f496ae16c79b723a157", }

数据分割

train
coyo-hd-11m-llavanext 11397144

数据集创建

高概念过滤

图像通过两个多标签分类器ML_Decoder TResNet-M Open Imagesmldanbooru进行标记,然后根据以下标准进行选择:

py def image_excluded(oi_tags, booru_tags): if (Product in oi_tags and no_humans in booru_tags) or (Text in oi_tags and no_humans in booru_tags and text_focus in booru_tags) or len(oi_tags) < 2 or len(booru_tags) < 3 or text-only_page in booru_tags: return True return False

这个简单的过滤器成功地移除了大部分低质量图像,这些图像仅包含没有背景的产品图像或仅包含文本的页面,如PowerPoint幻灯片。从数据集中大于512像素的2300万候选图像中,仅剩下1100万张图像。

多标签分类器的结果嵌入在行中作为tags_open_imagestags_booru,便于用于特定类别的下游任务。例如,如果您想在棒球帽上微调您的模型,您可以查找“baseball_cap”标签。

生成描述

https://huggingface.co/lmms-lab/llama3-llava-next-8b通过以下提示生成描述:

py prompt_gen = lambda txt :f""" Please make a detailed but succinct caption of this image. If you see text or objects, be sure to describe them in detail along with any other aspects of the foreground and background. As a hint, here is the alt-text attribute of the image, which may or may not have to do with the image:

Hint:

{txt}

"""

这大约有2.7%的时间会产生失败。失败定义为:

  1. 包含以下重复文本之一:to_reformats = [ no text, other objects, additional objects, no objects , alt-text]
  2. 包含重复序列。

这些描述通过Meta-Llama-3-8B进行重新格式化以修复重复或移除这些提及的内容。然后,如anime-caption-danbooru-2021-sfw-5m-hq中所示,修剪前缀。

短描述是通过以下提示在Meta-Llama-3-8B中生成的:

py prompt = lambda img_prompt: f""" Please take the following image caption and attempt to distill it into a single sentence. Remove any redundant lines or descriptions and make it a maximum of 30 words in length.

{img_prompt}

Please only write the caption and no other text. """

源数据

通过访问coyo-700m中的URL获取。

偏见讨论

数据集将偏向于多标签分类器识别的概念。

已知限制

可能仍存在非常少量的错误描述,但绝大多数已被消除。数据集尚未进行安全性评估,而是依赖于Cacao Brain的NSFW过滤方案。

blurry和watermark标签未被过滤。在第一种情况下,带有散景的图像通常会触发blurry标签,不应被排除。在第二种情况下,有许多机器学习任务对水印的存在不敏感,数据集中提供的标签提供了过滤它们的简单方法。

提示:如果您正在训练文本到图像的扩散模型,请仅使用带有水印的图像进行无条件训练。无分类器指导将优先创建没有水印的图像。

附加信息

数据集策展人

Caption Emporium

许可信息

数据集在Creative Commons ShareAlike (CC BY-SA 4.0)下可用。

引用信息

@misc{coyo-hd-11m-llavanext, author = { Caption Emporium }, title = { coyo-hd-11m-llavanext }, year = { 2024 }, publisher = { Huggingface }, journal = { Huggingface repository }, howpublished = {url{https://huggingface.co/datasets/CaptionEmporium/coyo-hd-11m-llavanext}}, }

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作