CaptionEmporium/anime-caption-danbooru-2021-sfw-5m-hq

Name: CaptionEmporium/anime-caption-danbooru-2021-sfw-5m-hq
Creator: CaptionEmporium
Published: 2024-06-09 22:28:57
License: 暂无描述

Hugging Face2024-06-09 更新2024-05-25 收录

下载链接：

https://hf-mirror.com/datasets/CaptionEmporium/anime-caption-danbooru-2021-sfw-5m-hq

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集包含来自Danbooru 2021数据集的1.43百万张图片的5.71百万条描述，每张图片有4条描述，分别由CogVLM、llava-v1.6-34b、llava-v1.6-34b清理版和llava-v1.6-34b缩短版生成。描述语言为英语，且经过了安全过滤，确保内容适合工作环境。数据集主要用于图像到文本的任务，规模在1百万到1千万之间。

提供机构：

CaptionEmporium

原始信息汇总

数据集卡片 for anime-caption-danbooru-2021-sfw-5m-hq

数据集描述

联系人: Caption Emporium

数据集摘要

这是一个包含5.71M条描述1.43M张图片的标注数据集，来自Danbooru 2021数据集的适合工作环境（SFW）过滤子集。每张图片有4条标注：1条由CogVLM生成，1条由llava-v1.6-34b生成，1条llava-v1.6-34b清洗后，1条llava-v1.6-34b缩短后。详细生成方法见下文。

大多数标注长度远超过77个词，不适合使用当前基于CLIP的方法进行分类。

语言

标注为英语。

适合工作环境过滤

原始数据集的适合工作环境部分使用Falconsai/nsfw_image_detection进行过滤，然后使用ML-Danbooru检测通常描绘色情材料的标签（如内裤等），并移除任何匹配的成人主题标签。

数据实例

一个示例行：

2341041

json { "id": 2341041, "md5": "d1d9046c08cffa243d629d68a238fffa", "caption_cogvlm": "The image showcases an anime-style illustration of a young woman with short brown hair, wearing a black skirt, a white blouse, and a yellow neckwear. She has a cheerful expression, with her brown eyes looking directly at the viewer. The woman is equipped with a large, intricate piece of machinery on her right side, which appears to be a part of a military or combat vehicle. The background is a clear blue sky, suggesting a daytime setting. The image also contains textual elements, including the characters name st. coochacha ueg6, the title Central Collection Agency Illustration 0006, and the artists name TWOBAN.", "caption_llava_34b": "The image depicts an animated character, a girl with brown hair styled in twin tails, wearing a school uniform with a white blouse, a black pleated skirt, and a yellow neckerchief. She has a headband with a bow and is smiling at the viewer. The character is also wearing a pair of gloves and has a propeller hair ornament.

The character is standing in front of a large piece of machinery, which appears to be a naval turret, suggesting a military or naval theme. The machinery is detailed with visible ammunition belts and shells, indicating that it is a weapon system.

The background of the image shows a clear blue sky with a few clouds, suggesting that the scene is set outdoors during the day. The overall style of the image is reminiscent of anime or manga, with vibrant colors and detailed character design.

The text "stomachache" and "Kantai Collection" are visible in the image, which could be part of the title or a reference to the series or game the character is from. The text "Kantai Collection" is a well-known Japanese video game and media franchise that features anthropomorphic naval ships as characters.

The character is labeled as "Teruzuki" from "Kantai Collection," and the object she is holding is identified as "Chou-10cm-hou-chan," which is likely a reference to a specific weapon or item within the context of the series.

The image is a digital illustration, and the character is drawn in a style that is typical for anime and manga, with exaggerated features and a focus on the characters expression and attire. The artwork is detailed and colorful, with a strong emphasis on the characters pose and the machinery in the background.", "caption_llava_34b_no_tags": "The image depicts an animated character, a girl with brown hair styled in twin tails, wearing a school uniform with a white blouse, a black pleated skirt, and a yellow neckerchief. She has a headband with a bow and is smiling at the viewer. The character is also wearing a pair of gloves and has a propeller hair ornament.

The text "stomachache" and "Kantai Collection" are visible in the image, which could be part of the title or a reference to the series or game the character is from. The character is labeled as Teruzuki from the Kantai Collection, a well-known Japanese video game and media franchise that features anthropomorphic naval ships as characters.

The character is holding an object identified as Chou-10cm-hou-chan, which is likely a reference to a specific weapon or item within the context of the series. ", "caption_llava_34b_no_tags_short": "Teruzuki, a girl with brown hair styled in twin tails, stands in front of a naval turret, wearing a school uniform and a propeller hair ornament. She smiles at the viewer, her gloves and bow-adorned headband adding to her charm. The background features a clear blue sky with clouds, while the machinery behind her is detailed with ammunition belts and shells. The image is a digital illustration, blending anime and manga styles with vibrant colors and exaggerated features. ", "mldanbooru_tag_caption": "anime style picture of a woman or girl, brown hair, long hair, solo, black skirt, blue eyes, skirt, neckerchief, braid, headband, breasts, day, sky, smile, gloves, looking at viewer, thighhighs, twin braids, school uniform, serafuku, cowboy shot, hair ornament, hairband, medium breasts, machinery, pleated skirt, turret, grey eyes, cannon, miniskirt, black gloves, hachimaki, character name, artist name, clothes writing, light brown hair, yellow neckwear, corset, propeller hair ornament", "wd_swinv2_tagger_v3_tags": "{"ratings": {"general": 0.0654296875, "sensitive": 0.92578125, "questionable": 0.00136566162109375, "explicit": 0.00012302398681640625}, "character": {"teruzuki_(kancolle)": 0.9921875}, "general": {"1girl": 0.99609375, "skirt": 0.953125, "school_uniform": 0.91796875, "serafuku": 0.90234375, "smile": 0.8671875, "ammunition_belt": 0.8046875, "solo": 0.7734375, "hairband": 0.76953125, "gloves": 0.7578125, "day": 0.734375, "breasts": 0.73046875, "braid": 0.73046875, "neckerchief": 0.71875, "brown_hair": 0.70703125, "miniskirt": 0.6953125, "sky": 0.66015625, "pleated_skirt": 0.6484375, "looking_at_viewer": 0.64453125, "clothes_writing": 0.64453125, "bullet": 0.640625, "blue_eyes": 0.62890625, "long_hair": 0.60546875, "propeller_hair_ornament": 0.5859375, "machinery": 0.5390625, "hair_ornament": 0.50390625, "blue_sky": 0.498046875, "twin_braids": 0.494140625, "black_skirt": 0.484375, "cloud": 0.46875, "headband": 0.458984375, "light_brown_hair": 0.45703125, "medium_breasts": 0.44921875, "short_sleeves": 0.431640625, "corset": 0.431640625, "blush": 0.423828125, "cowboy_shot": 0.3984375, "turret": 0.3828125, "outdoors": 0.357421875, "shell_casing": 0.35546875}}" }

LLaVA-derived Captions

首先，使用wd-swinv2-tagger-v3多标签分类器模型生成标签JSON。此标签JSON包含在行中作为wd_swinv2_tagger_v3_tags。

使用Danbooru2021-SQLite数据集获取角色和系列标签，因为这些标签的地面真实性被认为比合成数据更准确。

使用llava-v1.6-34b在分布式设置上生成标注，代码如下：

py tags = Entry.get_tags_from_id(session, entry_id) # GT tags tag_dict = anime_tags_swinv2[md5] # Predicted tags caption = , .join(tag_dict[general])

character_tags = list(filter(lambda _t: _t[1] == 4, tags)) char_tag_s = , .join([_t[0] for _t in character_tags]) series_tags = list(filter(lambda _t: _t[1] == 3, tags)) series_tag_s = , .join([_t[0] for _t in series_tags]) if len(character_tags) > 0 and len(series_tags) > 0: prompt = fThis image is labeled with the series tag(s) {series_tag_s} and character tag(s) {char_tag_s}. It is also labeled with the visual aspect tags of {caption}. Please explain the image with these tags considered. Go into details only about the contents of the scene and do not make suppositions outside of that. elif len(character_tags) > 0 and len(series_tags) == 0: prompt = fThis image is labeled with the character tag(s) {char_tag_s}. It is also labeled with the visual aspect tags of {caption}. Please explain the image with these tags considered. Go into details only about the contents of the scene and do not make suppositions outside of that. elif len(character_tags) == 0 and len(series_tags) > 0: prompt = fThis image is labeled with the series tag(s) {series_tag_s}. It is also labeled with the visual aspect tags of {caption}. Please explain the image with these tags considered. Go into details only about the contents of the scene and do not make suppositions outside of that. else: prompt = fThis image is labeled with the visual aspect tags of {caption}. Please explain the image with these tags considered. Go into details only about the contents of the scene and do not make suppositions outside of that.

这导致53.91%的标注明确提及标签，其余的以自然语言描述图像。这些初步标注包含在caption_llava_34b中。53.91%明确指定标签的标注随后使用Meta-Llama-3-8B-Instruct重新标注，提示如下：

py prompt = "You will assist me into removing references to tags in the caption below. Those tags are Danbooru (anime imageboard) tags. For example, you must replace tag references of 1girl to a sentence that refers to one girl, anime franchise names or character names (eg son_goku, sasuke_uchicha) in clear references, like Son Goku and Sasuke Uchicha. Tags refering to franchises names, like for example, boku_no_hero, dragon_ball etc should be presented as Boku no Hero and Dragon Ball. When you see a tag discussed that is not noted elsewhere in natural language, try to extract the relevant meaning of the tag and rewrite the sentence as it applies to the description. After you are done, the new description should not contain the word tags or any explicit reference to underscore-containing tags. Please write only the new caption below:"

任何未能使用此方法清理的标注随后再次使用Meta-Llama-3-70B-Instruct重新标注。

完全清洗后的标注存储在行中作为caption_llava_34b_no_tags。

随后，使用Meta-Llama-3-8B-Instruct缩短标注，提示如下：

py prompt = lambda img_caption: f""" Please take the following image caption and attempt to distill it into a single paragraph. Remove any redundant lines or descriptions and make it a maximum of 200 words in length, while preserving all details about characters, series, scenes, and depictions.

{img_caption}

Please only write the caption and no other text. """

这些短标注存储在行中作为caption_llava_34b_no_tags_short。

CogVLM-derived Captions

使用ML-Danbooru多标签分类器为每张图片生成“标签”标注。这些标注在每行中作为mldanbooru_tag_caption提供。它们不包括角色和系列标签。

使用CogVLM权重生成caption_cogvlm字段，提示如下：

py query = lambda tags_caption: f The following image is described by this list of visual tags:

{tags_caption}

Using these tags and the image above, please create a long and exact description of the image that is at most one paragraph. Avoid describing things that are not in the scene or which describe interpretations, such as "the atmosphere exudes confidence", but be sure to describe every element you see in detail and any objects, characters, or interactions you see.

清理标注前缀

标注常常显示重复的前缀。可以使用以下方法移除：

py REPEATED_OPENINGS = [ (The image showcases , ), (The image portrays , ), (The image appears to be , ), (The image is , ), (The image depicts , ), (The image features , ), (The image captures , ), (The image shows , ), (The image displays , ), (The image presents , ), (This image showcases , ), (This image portrays , ), (This image appears to be , ), (This image is , ), (This image depicts , ), (This image features , ), (This image captures , ), (This image shows , ),

搜集汇总

数据集介绍

构建方式

在动漫图像描述领域，该数据集的构建体现了多模态人工智能技术的融合应用。其核心流程始于对Danbooru 2021数据集的安全内容筛选，通过Falconsai/nsfw_image_detection与ML-Danbooru模型双重过滤，确保图像素材符合安全规范。随后，每幅图像均生成四类描述文本：CogVLM模型依据ML-Danbooru生成的视觉标签创作详细段落；LLaVA-v1.6-34b模型则结合wd-swinv2-tagger-v3分类器预测标签与Danbooru2021-SQLite提供的真实角色系列标签，通过动态提示模板生成初始描述。为进一步优化语言质量，采用Meta-Llama-3系列模型对含标签引用的描述进行语义净化与段落精简，最终形成包含原始、净化、精简三个版本的描述集合。整个构建过程通过分布式计算实现，涵盖143万幅图像与571万条描述，建立了规模化的动漫图像文本对应关系。

特点

该数据集在动漫视觉语言表征领域展现出鲜明的技术特征。其核心价值在于为每幅图像提供四类异构描述文本，包括基于视觉标签的CogVLM生成描述、LLaVA模型的多版本输出，形成了描述粒度与风格的多层次覆盖。描述文本普遍超过77个标记的长度特征，突破了传统CLIP模型的处理边界，为长文本图像理解任务提供了稀缺资源。数据集严格遵循安全内容规范，通过双重过滤机制剔除成人主题元素，确保学术研究的适用性。同时，每条数据均附带ML-Danbooru与wd-swinv2-tagger-v3模型生成的标准化标签体系，以及Danbooru2021-SQLite提供的真实元数据，构成了图像语义的多维度标注体系。这种结构化设计为跨模态检索、描述生成质量评估等研究提供了丰富的对比基准。

使用方法

在动漫图像理解研究实践中，该数据集支持多场景的应用范式。研究者可通过HuggingFace平台直接加载数据集，利用其标准化的图像标识符与多版本描述字段进行模型训练与评估。对于图像描述生成任务，可对比分析CogVLM与LLaVA不同版本描述的风格差异，探究视觉语言模型的表征特性。跨模态检索研究则可借助丰富的标签体系与长文本描述，构建细粒度的图文匹配基准。数据集提供的描述前缀标准化处理工具，能有效提升描述文本的语法一致性。需要注意的是，由于描述文本存在模型幻觉现象，建议在使用时结合人工评估或设计纠错机制。数据集遵循CC BY-SA 4.0许可协议，支持学术与商业用途的二次开发，为动漫领域的多模态人工智能研究提供了重要的基础资源。

背景与挑战

背景概述

在动漫图像理解与生成领域，高质量且规模化的图文配对数据对于推动多模态人工智能模型的发展至关重要。CaptionEmporium团队于2024年发布的anime-caption-danbooru-2021-sfw-5m-hq数据集，正是针对这一需求而构建的专项资源。该数据集源自Danbooru2021图像库的安全子集，通过集成CogVLM与LLaVA-v1.6-34b等先进视觉语言模型，为超过140万张动漫风格图像生成了共计约571万条多样化英文描述。其核心研究问题聚焦于如何为动漫这一特定视觉领域构建大规模、细粒度且语义丰富的图像描述语料，以支持图像到文本的生成、检索与理解任务，为动漫内容分析与生成模型提供了重要的数据基础。

当前挑战

该数据集致力于解决动漫图像自动描述这一特定领域问题，其核心挑战在于动漫图像通常包含高度风格化、符号化以及依赖于特定亚文化知识的视觉元素，通用视觉语言模型难以准确捕捉其独特的美学特征、角色属性及叙事语境。在构建过程中，团队面临多重技术挑战：首先，需从原始Danbooru数据中有效过滤成人内容，依赖如Falconsai/nsfw_image_detection等模型进行自动化安全筛选，但可能存在误判或遗漏；其次，利用多标签分类模型（如ML-Danbooru、wd-swinv2-tagger-v3）生成初始标签，并引导大语言模型生成描述，此流程易引入模型幻觉，对冷门动漫系列或复杂漫画分镜的理解尤为困难；再者，生成的描述文本普遍超过77个标记，与当前基于CLIP的判别方法兼容性不足，限制了其在某些下游任务中的直接应用。

常用场景

经典使用场景

在动漫图像理解与生成领域，该数据集以其大规模、高质量的动漫风格图像与多模态标注为研究提供了宝贵资源。其经典使用场景聚焦于训练和评估视觉-语言模型，特别是针对动漫这一特定艺术风格的图像描述生成任务。通过整合CogVLM与LLaVA等先进模型生成的多样化文本描述，研究者能够构建更精准的跨模态对齐系统，探索模型在复杂视觉场景下的语义理解与自然语言生成能力。

衍生相关工作

围绕该数据集，已衍生出一系列聚焦动漫多模态学习的经典研究工作。这些工作通常利用其丰富的标注变体（如原始描述、清洗后描述、简短描述）来优化模型架构或训练策略，例如改进视觉编码器对动漫特征的提取能力，或探索提示工程对描述生成质量的影响。相关成果进一步推动了如风格自适应图像描述、跨模态检索等方向的发展，并在开源社区中形成了持续的模型迭代与基准测试。

数据集最近研究