CaptionEmporium/anime-caption-danbooru-2021-sfw-5m-hq
收藏数据集卡片 for anime-caption-danbooru-2021-sfw-5m-hq
数据集描述
- 联系人: Caption Emporium
数据集摘要
这是一个包含5.71M条描述1.43M张图片的标注数据集,来自Danbooru 2021数据集的适合工作环境(SFW)过滤子集。每张图片有4条标注:1条由CogVLM生成,1条由llava-v1.6-34b生成,1条llava-v1.6-34b清洗后,1条llava-v1.6-34b缩短后。详细生成方法见下文。
大多数标注长度远超过77个词,不适合使用当前基于CLIP的方法进行分类。
语言
标注为英语。
适合工作环境过滤
原始数据集的适合工作环境部分使用Falconsai/nsfw_image_detection进行过滤,然后使用ML-Danbooru检测通常描绘色情材料的标签(如内裤等),并移除任何匹配的成人主题标签。
数据实例
一个示例行:

json { "id": 2341041, "md5": "d1d9046c08cffa243d629d68a238fffa", "caption_cogvlm": "The image showcases an anime-style illustration of a young woman with short brown hair, wearing a black skirt, a white blouse, and a yellow neckwear. She has a cheerful expression, with her brown eyes looking directly at the viewer. The woman is equipped with a large, intricate piece of machinery on her right side, which appears to be a part of a military or combat vehicle. The background is a clear blue sky, suggesting a daytime setting. The image also contains textual elements, including the characters name st. coochacha ueg6, the title Central Collection Agency Illustration 0006, and the artists name TWOBAN.", "caption_llava_34b": "The image depicts an animated character, a girl with brown hair styled in twin tails, wearing a school uniform with a white blouse, a black pleated skirt, and a yellow neckerchief. She has a headband with a bow and is smiling at the viewer. The character is also wearing a pair of gloves and has a propeller hair ornament.
The character is standing in front of a large piece of machinery, which appears to be a naval turret, suggesting a military or naval theme. The machinery is detailed with visible ammunition belts and shells, indicating that it is a weapon system.
The background of the image shows a clear blue sky with a few clouds, suggesting that the scene is set outdoors during the day. The overall style of the image is reminiscent of anime or manga, with vibrant colors and detailed character design.
The text "stomachache" and "Kantai Collection" are visible in the image, which could be part of the title or a reference to the series or game the character is from. The text "Kantai Collection" is a well-known Japanese video game and media franchise that features anthropomorphic naval ships as characters.
The character is labeled as "Teruzuki" from "Kantai Collection," and the object she is holding is identified as "Chou-10cm-hou-chan," which is likely a reference to a specific weapon or item within the context of the series.
The image is a digital illustration, and the character is drawn in a style that is typical for anime and manga, with exaggerated features and a focus on the characters expression and attire. The artwork is detailed and colorful, with a strong emphasis on the characters pose and the machinery in the background.", "caption_llava_34b_no_tags": "The image depicts an animated character, a girl with brown hair styled in twin tails, wearing a school uniform with a white blouse, a black pleated skirt, and a yellow neckerchief. She has a headband with a bow and is smiling at the viewer. The character is also wearing a pair of gloves and has a propeller hair ornament.
The character is standing in front of a large piece of machinery, which appears to be a naval turret, suggesting a military or naval theme. The machinery is detailed with visible ammunition belts and shells, indicating that it is a weapon system.
The background of the image shows a clear blue sky with a few clouds, suggesting that the scene is set outdoors during the day. The overall style of the image is reminiscent of anime or manga, with vibrant colors and detailed character design.
The text "stomachache" and "Kantai Collection" are visible in the image, which could be part of the title or a reference to the series or game the character is from. The character is labeled as Teruzuki from the Kantai Collection, a well-known Japanese video game and media franchise that features anthropomorphic naval ships as characters.
The image is a digital illustration, and the character is drawn in a style that is typical for anime and manga, with exaggerated features and a focus on the characters expression and attire. The artwork is detailed and colorful, with a strong emphasis on the characters pose and the machinery in the background.
The character is holding an object identified as Chou-10cm-hou-chan, which is likely a reference to a specific weapon or item within the context of the series. ", "caption_llava_34b_no_tags_short": "Teruzuki, a girl with brown hair styled in twin tails, stands in front of a naval turret, wearing a school uniform and a propeller hair ornament. She smiles at the viewer, her gloves and bow-adorned headband adding to her charm. The background features a clear blue sky with clouds, while the machinery behind her is detailed with ammunition belts and shells. The image is a digital illustration, blending anime and manga styles with vibrant colors and exaggerated features. ", "mldanbooru_tag_caption": "anime style picture of a woman or girl, brown hair, long hair, solo, black skirt, blue eyes, skirt, neckerchief, braid, headband, breasts, day, sky, smile, gloves, looking at viewer, thighhighs, twin braids, school uniform, serafuku, cowboy shot, hair ornament, hairband, medium breasts, machinery, pleated skirt, turret, grey eyes, cannon, miniskirt, black gloves, hachimaki, character name, artist name, clothes writing, light brown hair, yellow neckwear, corset, propeller hair ornament", "wd_swinv2_tagger_v3_tags": "{"ratings": {"general": 0.0654296875, "sensitive": 0.92578125, "questionable": 0.00136566162109375, "explicit": 0.00012302398681640625}, "character": {"teruzuki_(kancolle)": 0.9921875}, "general": {"1girl": 0.99609375, "skirt": 0.953125, "school_uniform": 0.91796875, "serafuku": 0.90234375, "smile": 0.8671875, "ammunition_belt": 0.8046875, "solo": 0.7734375, "hairband": 0.76953125, "gloves": 0.7578125, "day": 0.734375, "breasts": 0.73046875, "braid": 0.73046875, "neckerchief": 0.71875, "brown_hair": 0.70703125, "miniskirt": 0.6953125, "sky": 0.66015625, "pleated_skirt": 0.6484375, "looking_at_viewer": 0.64453125, "clothes_writing": 0.64453125, "bullet": 0.640625, "blue_eyes": 0.62890625, "long_hair": 0.60546875, "propeller_hair_ornament": 0.5859375, "machinery": 0.5390625, "hair_ornament": 0.50390625, "blue_sky": 0.498046875, "twin_braids": 0.494140625, "black_skirt": 0.484375, "cloud": 0.46875, "headband": 0.458984375, "light_brown_hair": 0.45703125, "medium_breasts": 0.44921875, "short_sleeves": 0.431640625, "corset": 0.431640625, "blush": 0.423828125, "cowboy_shot": 0.3984375, "turret": 0.3828125, "outdoors": 0.357421875, "shell_casing": 0.35546875}}" }
LLaVA-derived Captions
首先,使用wd-swinv2-tagger-v3多标签分类器模型生成标签JSON。此标签JSON包含在行中作为wd_swinv2_tagger_v3_tags。
使用Danbooru2021-SQLite数据集获取角色和系列标签,因为这些标签的地面真实性被认为比合成数据更准确。
使用llava-v1.6-34b在分布式设置上生成标注,代码如下:
py tags = Entry.get_tags_from_id(session, entry_id) # GT tags tag_dict = anime_tags_swinv2[md5] # Predicted tags caption = , .join(tag_dict[general])
character_tags = list(filter(lambda _t: _t[1] == 4, tags)) char_tag_s = , .join([_t[0] for _t in character_tags]) series_tags = list(filter(lambda _t: _t[1] == 3, tags)) series_tag_s = , .join([_t[0] for _t in series_tags]) if len(character_tags) > 0 and len(series_tags) > 0: prompt = fThis image is labeled with the series tag(s) {series_tag_s} and character tag(s) {char_tag_s}. It is also labeled with the visual aspect tags of {caption}. Please explain the image with these tags considered. Go into details only about the contents of the scene and do not make suppositions outside of that. elif len(character_tags) > 0 and len(series_tags) == 0: prompt = fThis image is labeled with the character tag(s) {char_tag_s}. It is also labeled with the visual aspect tags of {caption}. Please explain the image with these tags considered. Go into details only about the contents of the scene and do not make suppositions outside of that. elif len(character_tags) == 0 and len(series_tags) > 0: prompt = fThis image is labeled with the series tag(s) {series_tag_s}. It is also labeled with the visual aspect tags of {caption}. Please explain the image with these tags considered. Go into details only about the contents of the scene and do not make suppositions outside of that. else: prompt = fThis image is labeled with the visual aspect tags of {caption}. Please explain the image with these tags considered. Go into details only about the contents of the scene and do not make suppositions outside of that.
这导致53.91%的标注明确提及标签,其余的以自然语言描述图像。这些初步标注包含在caption_llava_34b中。53.91%明确指定标签的标注随后使用Meta-Llama-3-8B-Instruct重新标注,提示如下:
py prompt = "You will assist me into removing references to tags in the caption below. Those tags are Danbooru (anime imageboard) tags. For example, you must replace tag references of 1girl to a sentence that refers to one girl, anime franchise names or character names (eg son_goku, sasuke_uchicha) in clear references, like Son Goku and Sasuke Uchicha. Tags refering to franchises names, like for example, boku_no_hero, dragon_ball etc should be presented as Boku no Hero and Dragon Ball. When you see a tag discussed that is not noted elsewhere in natural language, try to extract the relevant meaning of the tag and rewrite the sentence as it applies to the description. After you are done, the new description should not contain the word tags or any explicit reference to underscore-containing tags. Please write only the new caption below:"
任何未能使用此方法清理的标注随后再次使用Meta-Llama-3-70B-Instruct重新标注。
完全清洗后的标注存储在行中作为caption_llava_34b_no_tags。
随后,使用Meta-Llama-3-8B-Instruct缩短标注,提示如下:
py prompt = lambda img_caption: f""" Please take the following image caption and attempt to distill it into a single paragraph. Remove any redundant lines or descriptions and make it a maximum of 200 words in length, while preserving all details about characters, series, scenes, and depictions.
{img_caption}
Please only write the caption and no other text. """
这些短标注存储在行中作为caption_llava_34b_no_tags_short。
CogVLM-derived Captions
使用ML-Danbooru多标签分类器为每张图片生成“标签”标注。这些标注在每行中作为mldanbooru_tag_caption提供。它们不包括角色和系列标签。
使用CogVLM权重生成caption_cogvlm字段,提示如下:
py query = lambda tags_caption: f The following image is described by this list of visual tags:
{tags_caption}
Using these tags and the image above, please create a long and exact description of the image that is at most one paragraph. Avoid describing things that are not in the scene or which describe interpretations, such as "the atmosphere exudes confidence", but be sure to describe every element you see in detail and any objects, characters, or interactions you see.
清理标注前缀
标注常常显示重复的前缀。可以使用以下方法移除:
py REPEATED_OPENINGS = [ (The image showcases , ), (The image portrays , ), (The image appears to be , ), (The image is , ), (The image depicts , ), (The image features , ), (The image captures , ), (The image shows , ), (The image displays , ), (The image presents , ), (This image showcases , ), (This image portrays , ), (This image appears to be , ), (This image is , ), (This image depicts , ), (This image features , ), (This image captures , ), (This image shows , ),




