tagger_vocabs
收藏魔搭社区2025-12-04 更新2025-11-29 收录
下载链接:
https://modelscope.cn/datasets/deepghs/tagger_vocabs
下载链接
链接失效反馈官方服务:
资源简介:
# Tagger Vocabularies Dataset
## Summary
This repository provides a comprehensive collection of **vocabulary datasets** specifically designed for image tagging and annotation systems. The dataset contains structured tag vocabularies from multiple popular tagging models including **DeepDanbooru**, **MLDanbooru**, and various **Waifu Diffusion** tagger variants. Each vocabulary file contains detailed metadata for thousands of tags, organized with aliases, categories, usage counts, and word breakdowns to support robust **natural language processing** and **computer vision** applications.
The dataset features meticulously structured JSON files that include essential tag information such as name, aliases, category classification, occurrence counts, and semantic word groupings. This enables researchers and developers to build sophisticated tagging systems, improve model interpretability, and enhance cross-model compatibility. The **vocabulary standardization** across different tagger models facilitates comparative analysis and transfer learning between various annotation frameworks.
Performance-wise, these vocabularies represent the culmination of extensive training on large-scale image datasets, with tag counts ranging from hundreds to millions of occurrences. The categorization system (0-9) provides logical grouping of tags by semantic domains, while the word breakdowns offer insights into tag composition and relationships. This makes the dataset particularly valuable for **multi-modal learning** applications that bridge visual content understanding with textual annotation.
The dataset covers diverse domains including character attributes, clothing, accessories, settings, and content ratings, making it suitable for various applications in content moderation, image search, automated annotation, and AI-assisted creative tools. The inclusion of multiple model variants ensures comprehensive coverage of different tagging philosophies and annotation granularities.
## Usage
For direct file access:
```python
import json
from huggingface_hub import hf_hub_download
# Download and load specific vocabulary file
file_path = hf_hub_download(
repo_id="deepghs/tagger_vocabs",
filename="deepdanbooru/tags.json",
repo_type="dataset"
)
with open(file_path, 'r', encoding='utf-8') as f:
tags_data = json.load(f)
# Process tags
for tag in tags_data[:10]:
print(f"Tag: {tag['name']} (Category: {tag['category']})")
print(f"Aliases: {', '.join(tag['aliases'])}")
print(f"Occurrences: {tag['count']}")
```
## Available Vocabularies
The dataset includes vocabulary files for the following tagger models:
- **deepdanbooru/tags.json** (3.12 MB) - Original DeepDanbooru vocabulary
- **mldanbooru/tags.json** (4.00 MB) - MLDanbooru vocabulary with enhanced coverage
- **wd-v1-4-convnext-tagger/tags.json** (2.18 MB) - Waifu Diffusion ConvNeXT tagger
- **wd-v1-4-convnext-tagger-v2/tags.json** - Updated ConvNeXT tagger vocabulary
- **wd-v1-4-convnextv2-tagger-v2/tags.json** (3.08 MB) - ConvNeXTV2-based tagger
- **wd-v1-4-moat-tagger-v2/tags.json** (3.08 MB) - MOAT architecture tagger
- **wd-v1-4-swinv2-tagger-v2/tags.json** - SwinV2 transformer tagger
- **wd-v1-4-vit-tagger/tags.json** - Vision Transformer tagger
- **wd-v1-4-vit-tagger-v2/tags.json** - Updated ViT tagger vocabulary
## Data Structure
Each vocabulary file follows the same JSON structure:
```json
[
{
"aliases": ["alternative_names"],
"category": 0,
"count": 12345,
"id": 123456,
"name": "primary_tag_name",
"words": [
["word", "breakdown"],
["alternative", "phrasing"]
]
}
]
```
**Field Descriptions:**
- `aliases`: Alternative names/synonyms for the tag
- `category`: Numerical category (0-9) for semantic grouping
- `count`: Number of occurrences in training data
- `id`: Unique identifier for the tag
- `name`: Primary tag name
- `words`: Semantic word breakdowns for NLP processing
## Original Content
vocabs data for tagger models, maybe useful for some basic NLP calculation.
## Citation
```bibtex
@misc{tagger_vocabs,
title = {Tagger Vocabularies Dataset},
author = {deepghs},
howpublished = {\url{https://huggingface.co/datasets/deepghs/tagger_vocabs}},
year = {2023},
note = {Comprehensive vocabulary datasets for image tagging models including DeepDanbooru, MLDanbooru, and Waifu Diffusion taggers},
abstract = {This repository provides a comprehensive collection of vocabulary datasets specifically designed for image tagging and annotation systems. The dataset contains structured tag vocabularies from multiple popular tagging models including DeepDanbooru, MLDanbooru, and various Waifu Diffusion tagger variants. Each vocabulary file contains detailed metadata for thousands of tags, organized with aliases, categories, usage counts, and word breakdowns to support robust natural language processing and computer vision applications. The dataset features meticulously structured JSON files that include essential tag information such as name, aliases, category classification, occurrence counts, and semantic word groupings.},
keywords = {vocabulary, tagging, image-annotation, nlp, danbooru}
}
```
"# 标签器词汇数据集(Tagger Vocabularies Dataset)
## 概述
本仓库提供专为图像打标与标注系统设计的**词汇数据集(vocabulary datasets)**合集,涵盖多款主流打标模型的结构化标签词汇,包括**DeepDanbooru**、**MLDanbooru**以及各类**Waifu Diffusion**打标器变体。每个词汇文件均包含数千条标签的详细元数据,附带别名、分类、使用频次与词汇拆分信息,可支撑高性能的**自然语言处理(natural language processing)**与**计算机视觉(computer vision)**应用。
本数据集采用精心结构化的JSON文件格式,存储标签的核心信息,包括名称、别名、分类、出现次数与语义词汇分组。这一设计可帮助研究人员与开发者构建复杂的打标系统、提升模型可解释性,并增强跨模型兼容性。不同打标模型间的**词汇标准化(vocabulary standardization)**机制,便于各类标注框架间的对比分析与迁移学习。
从性能维度来看,这些词汇集均基于大规模图像数据集的海量训练结果构建,标签出现频次从数百到数百万不等。其采用的0-9分类系统可按语义域对标签进行逻辑分组,而词汇拆分功能则可揭示标签的构成与关联关系。这使得该数据集特别适用于衔接视觉内容理解与文本标注的**多模态学习(multi-modal learning)**应用。
本数据集涵盖角色属性、服饰、配饰、场景与内容评级等多元领域,适用于内容审核、图像检索、自动化标注与AI辅助创作工具等各类场景。其包含的多模型变体版本,可全面覆盖不同打标理念与标注粒度的需求。
## 使用方法
若需直接访问文件:
python
import json
from huggingface_hub import hf_hub_download
# 下载并加载指定词汇文件
file_path = hf_hub_download(
repo_id="deepghs/tagger_vocabs",
filename="deepdanbooru/tags.json",
repo_type="dataset"
)
with open(file_path, 'r', encoding='utf-8') as f:
tags_data = json.load(f)
# 处理标签
for tag in tags_data[:10]:
print(f"标签:{tag['name']}(分类:{tag['category']})")
print(f"别名:{', '.join(tag['aliases'])}")
print(f"出现次数:{tag['count']}")
## 可用词汇集
本数据集包含以下打标模型的词汇文件:
- **deepdanbooru/tags.json**(3.12 MB):原始DeepDanbooru词汇集
- **mldanbooru/tags.json**(4.00 MB):覆盖范围更广泛的MLDanbooru词汇集
- **wd-v1-4-convnext-tagger/tags.json**(2.18 MB):Waifu Diffusion ConvNeXT打标器词汇集
- **wd-v1-4-convnext-tagger-v2/tags.json**:更新版ConvNeXT打标器词汇集
- **wd-v1-4-convnextv2-tagger-v2/tags.json**(3.08 MB):基于ConvNeXTV2的打标器词汇集
- **wd-v1-4-moat-tagger-v2/tags.json**(3.08 MB):采用MOAT架构的打标器词汇集
- **wd-v1-4-swinv2-tagger-v2/tags.json**:SwinV2 Transformer打标器词汇集
- **wd-v1-4-vit-tagger/tags.json**:视觉Transformer(Vision Transformer,ViT)打标器词汇集
- **wd-v1-4-vit-tagger-v2/tags.json**:更新版ViT打标器词汇集
## 数据结构
每个词汇文件均遵循统一的JSON结构:
json
[
{
"aliases": ["alternative_names"],
"category": 0,
"count": 12345,
"id": 123456,
"name": "primary_tag_name",
"words": [
["word", "breakdown"],
["alternative", "phrasing"]
]
}
]
**字段说明:**
- `aliases`:标签的别名或同义词
- `category`:用于语义分组的数字分类编码(0-9)
- `count`:标签在训练数据中的出现次数
- `id`:标签的唯一标识符
- `name`:标签的主名称
- `words`:用于自然语言处理的语义词汇拆分项
## 原始内容
适用于打标模型的词汇数据,可用于部分基础自然语言处理计算。
## 引用格式
bibtex
@misc{tagger_vocabs,
title = {Tagger Vocabularies Dataset},
author = {deepghs},
howpublished = {url{https://huggingface.co/datasets/deepghs/tagger_vocabs}},
year = {2023},
note = {专为图像打标与标注系统设计的综合词汇数据集,涵盖DeepDanbooru、MLDanbooru及各类Waifu Diffusion打标器},
abstract = {本仓库提供专为图像标注与打标系统设计的词汇数据集合集,涵盖多款主流打标模型的结构化标签词汇,包括DeepDanbooru、MLDanbooru以及各类Waifu Diffusion打标器变体。每个词汇文件均包含数千条标签的详细元数据,附带别名、分类、使用频次与词汇拆分信息,可支撑高性能的自然语言处理与计算机视觉应用。本数据集采用精心结构化的JSON文件格式,存储标签的核心信息,包括名称、别名、分类、出现次数与语义词汇分组。},
keywords = {vocabulary, tagging, image-annotation, nlp, danbooru}
}
"
提供机构:
maas
创建时间:
2024-12-03



