deepghs/tagger_vocabs
收藏Hugging Face2025-11-17 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/deepghs/tagger_vocabs
下载链接
链接失效反馈官方服务:
资源简介:
---
license: openrail
task_categories: [feature-extraction]
language: [en, multilingual]
size_categories: [100K<n<1M]
tags: [vocabulary, tagging, image-annotation, nlp, danbooru, waifu-diffusion]
---
# Tagger Vocabularies Dataset
## Summary
This repository provides a comprehensive collection of **vocabulary datasets** specifically designed for image tagging and annotation systems. The dataset contains structured tag vocabularies from multiple popular tagging models including **DeepDanbooru**, **MLDanbooru**, and various **Waifu Diffusion** tagger variants. Each vocabulary file contains detailed metadata for thousands of tags, organized with aliases, categories, usage counts, and word breakdowns to support robust **natural language processing** and **computer vision** applications.
The dataset features meticulously structured JSON files that include essential tag information such as name, aliases, category classification, occurrence counts, and semantic word groupings. This enables researchers and developers to build sophisticated tagging systems, improve model interpretability, and enhance cross-model compatibility. The **vocabulary standardization** across different tagger models facilitates comparative analysis and transfer learning between various annotation frameworks.
Performance-wise, these vocabularies represent the culmination of extensive training on large-scale image datasets, with tag counts ranging from hundreds to millions of occurrences. The categorization system (0-9) provides logical grouping of tags by semantic domains, while the word breakdowns offer insights into tag composition and relationships. This makes the dataset particularly valuable for **multi-modal learning** applications that bridge visual content understanding with textual annotation.
The dataset covers diverse domains including character attributes, clothing, accessories, settings, and content ratings, making it suitable for various applications in content moderation, image search, automated annotation, and AI-assisted creative tools. The inclusion of multiple model variants ensures comprehensive coverage of different tagging philosophies and annotation granularities.
## Usage
For direct file access:
```python
import json
from huggingface_hub import hf_hub_download
# Download and load specific vocabulary file
file_path = hf_hub_download(
repo_id="deepghs/tagger_vocabs",
filename="deepdanbooru/tags.json",
repo_type="dataset"
)
with open(file_path, 'r', encoding='utf-8') as f:
tags_data = json.load(f)
# Process tags
for tag in tags_data[:10]:
print(f"Tag: {tag['name']} (Category: {tag['category']})")
print(f"Aliases: {', '.join(tag['aliases'])}")
print(f"Occurrences: {tag['count']}")
```
## Available Vocabularies
The dataset includes vocabulary files for the following tagger models:
- **deepdanbooru/tags.json** (3.12 MB) - Original DeepDanbooru vocabulary
- **mldanbooru/tags.json** (4.00 MB) - MLDanbooru vocabulary with enhanced coverage
- **wd-v1-4-convnext-tagger/tags.json** (2.18 MB) - Waifu Diffusion ConvNeXT tagger
- **wd-v1-4-convnext-tagger-v2/tags.json** - Updated ConvNeXT tagger vocabulary
- **wd-v1-4-convnextv2-tagger-v2/tags.json** (3.08 MB) - ConvNeXTV2-based tagger
- **wd-v1-4-moat-tagger-v2/tags.json** (3.08 MB) - MOAT architecture tagger
- **wd-v1-4-swinv2-tagger-v2/tags.json** - SwinV2 transformer tagger
- **wd-v1-4-vit-tagger/tags.json** - Vision Transformer tagger
- **wd-v1-4-vit-tagger-v2/tags.json** - Updated ViT tagger vocabulary
## Data Structure
Each vocabulary file follows the same JSON structure:
```json
[
{
"aliases": ["alternative_names"],
"category": 0,
"count": 12345,
"id": 123456,
"name": "primary_tag_name",
"words": [
["word", "breakdown"],
["alternative", "phrasing"]
]
}
]
```
**Field Descriptions:**
- `aliases`: Alternative names/synonyms for the tag
- `category`: Numerical category (0-9) for semantic grouping
- `count`: Number of occurrences in training data
- `id`: Unique identifier for the tag
- `name`: Primary tag name
- `words`: Semantic word breakdowns for NLP processing
## Original Content
vocabs data for tagger models, maybe useful for some basic NLP calculation.
## Citation
```bibtex
@misc{tagger_vocabs,
title = {Tagger Vocabularies Dataset},
author = {deepghs},
howpublished = {\url{https://huggingface.co/datasets/deepghs/tagger_vocabs}},
year = {2023},
note = {Comprehensive vocabulary datasets for image tagging models including DeepDanbooru, MLDanbooru, and Waifu Diffusion taggers},
abstract = {This repository provides a comprehensive collection of vocabulary datasets specifically designed for image tagging and annotation systems. The dataset contains structured tag vocabularies from multiple popular tagging models including DeepDanbooru, MLDanbooru, and various Waifu Diffusion tagger variants. Each vocabulary file contains detailed metadata for thousands of tags, organized with aliases, categories, usage counts, and word breakdowns to support robust natural language processing and computer vision applications. The dataset features meticulously structured JSON files that include essential tag information such as name, aliases, category classification, occurrence counts, and semantic word groupings.},
keywords = {vocabulary, tagging, image-annotation, nlp, danbooru}
}
```
许可证:openrail
任务类别:[特征提取(feature-extraction)]
语言:[英语,多语言]
规模类别:[100K<n<1M]
标签:[词汇表,标注,图像标注,自然语言处理(NLP),danbooru,waifu-diffusion]
# 标注器词汇数据集
## 摘要
本仓库提供了一套专为图像打标与标注系统打造的综合性**词汇数据集(vocabulary datasets)**集合。本数据集收录了多款主流打标模型的结构化标签词汇表,包括**DeepDanbooru(DeepDanbooru)**、**MLDanbooru(MLDanbooru)**以及各类**Waifu Diffusion(Waifu Diffusion)**打标器变体。每个词汇文件均包含数千个标签的详细元数据,通过别名、类别、使用频次与词汇拆分进行组织,可用于支撑高性能的**自然语言处理(natural language processing,NLP)**与**计算机视觉(computer vision)**应用。
本数据集采用精心设计的结构化JSON文件格式,存储了标签的核心信息,包括名称、别名、类别分类、出现频次以及语义词汇分组。这能够帮助研究人员与开发者构建复杂的打标系统,提升模型可解释性,并增强跨模型兼容性。不同打标模型间的**词汇标准化(vocabulary standardization)**,可促进各类标注框架之间的对比分析与迁移学习。
从性能层面来看,这些词汇表是在大规模图像数据集上进行大量训练的成果,标签的出现频次从数百到数百万不等。该分类系统(0-9)可按照语义领域对标签进行逻辑分组,而词汇拆分功能则可帮助理解标签的构成与关联关系。这使得本数据集对于衔接视觉内容理解与文本标注的**多模态学习(multi-modal learning)**应用具有极高价值。
本数据集涵盖了角色属性、服饰、配饰、场景以及内容分级等多个领域,可适用于内容审核、图像搜索、自动标注以及AI辅助创作工具等各类场景。收录多款模型变体的做法,确保了对不同打标理念与标注粒度的全面覆盖。
## 使用方法
若需直接访问文件:
python
import json
from huggingface_hub import hf_hub_download
# Download and load specific vocabulary file
file_path = hf_hub_download(
repo_id="deepghs/tagger_vocabs",
filename="deepdanbooru/tags.json",
repo_type="dataset"
)
with open(file_path, 'r', encoding='utf-8') as f:
tags_data = json.load(f)
# Process tags
for tag in tags_data[:10]:
print(f"Tag: {tag['name']} (Category: {tag['category']})")
print(f"Aliases: {', '.join(tag['aliases'])}")
print(f"Occurrences: {tag['count']}")
## 可用词汇表
本数据集包含以下打标模型的词汇表文件:
- **deepdanbooru/tags.json**(3.12 MB):原始DeepDanbooru词汇表
- **mldanbooru/tags.json**(4.00 MB):覆盖范围更全面的MLDanbooru词汇表
- **wd-v1-4-convnext-tagger/tags.json**(2.18 MB):Waifu Diffusion ConvNeXT打标器
- **wd-v1-4-convnext-tagger-v2/tags.json**:更新版ConvNeXT打标器词汇表
- **wd-v1-4-convnextv2-tagger-v2/tags.json**(3.08 MB):基于ConvNeXTV2的打标器
- **wd-v1-4-moat-tagger-v2/tags.json**(3.08 MB):采用MOAT架构的打标器
- **wd-v1-4-swinv2-tagger-v2/tags.json**:SwinV2 Transformer打标器
- **wd-v1-4-vit-tagger/tags.json**:视觉Transformer(Vision Transformer)打标器
- **wd-v1-4-vit-tagger-v2/tags.json**:更新版ViT打标器词汇表
## 数据结构
每个词汇表文件均采用统一的JSON结构:
json
[
{
"aliases": ["alternative_names"],
"category": 0,
"count": 12345,
"id": 123456,
"name": "primary_tag_name",
"words": [
["word", "breakdown"],
["alternative", "phrasing"]
]
}
]
**字段说明:**
- `aliases`:标签的别名/同义词
- `category`:用于语义分组的数值类别(0-9)
- `count`:训练数据中的标签出现频次
- `id`:标签的唯一标识符
- `name`:标签的主名称
- `words`:用于自然语言处理的语义词汇拆分项
## 原始内容
本部分内容为打标模型的词汇表数据,可用于部分基础的自然语言处理计算。
## 引用格式
bibtex
@misc{tagger_vocabs,
title = {Tagger Vocabularies Dataset},
author = {deepghs},
howpublished = {url{https://huggingface.co/datasets/deepghs/tagger_vocabs}},
year = {2023},
note = {Comprehensive vocabulary datasets for image tagging models including DeepDanbooru, MLDanbooru, and Waifu Diffusion taggers},
abstract = {This repository provides a comprehensive collection of vocabulary datasets specifically designed for image tagging and annotation systems. The dataset contains structured tag vocabularies from multiple popular tagging models including DeepDanbooru, MLDanbooru, and various Waifu Diffusion tagger variants. Each vocabulary file contains detailed metadata for thousands of tags, organized with aliases, categories, usage counts, and word breakdowns to support robust natural language processing and computer vision applications. The dataset features meticulously structured JSON files that include essential tag information such as name, aliases, category classification, occurrence counts, and semantic word groupings.},
keywords = {vocabulary, tagging, image-annotation, nlp, danbooru}
}
提供机构:
deepghs
原始信息汇总
数据集许可证
- 许可证类型:Openrail



