five

freedumb2000/anima-tagger-artifacts

收藏
Hugging Face2026-04-21 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/freedumb2000/anima-tagger-artifacts
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc0-1.0 task_categories: - text-retrieval - feature-extraction language: - en tags: - danbooru - anime - rag - tag-retrieval - anima size_categories: - 100K<n<1M --- # anima-tagger-artifacts Pre-built retrieval artefacts for the **Anima** tag format of the [`sd-webui-prompt-enhancer`](https://github.com/Gunther-Schulz/sd-webui-prompt-enhancer) Stable Diffusion WebUI extension. Lets the extension's Anima pipeline do real-time embedding-based tag validation and shortlist retrieval without users needing to rebuild a 270k+ entry FAISS index locally. ## Contents | File | Size | Description | |---|---|---| | `tags.sqlite` | ~30 MB | 273,025 Danbooru tags (name, category, post count, aliases, wiki). Post-count floor 10. | | `tags.faiss` | ~1.1 GB | FAISS FlatIP index of bge-m3 embeddings (1024-dim) for every tag. Artist and character embeddings include co-occurrence signatures (top-12 general tags from their actual Danbooru posts). | | `cooccurrence.sqlite` | ~10 MB | Pointwise-mutual-information table for character↔series, character↔artist, series↔character pairs. Enables automatic series-pairing (e.g. `hatsune_miku` → `vocaloid`) at query time. | | `VERSION` | <1 kB | JSON manifest with per-file sha256 + size + build date. | ## Usage Automatic — the extension's `install.py` downloads these on Forge startup, verifies sha256 against `VERSION`, and re-downloads when the upstream hash changes. Manual (e.g. for other projects): ```python from huggingface_hub import hf_hub_download for fname in ("tags.sqlite", "tags.faiss", "cooccurrence.sqlite", "VERSION"): hf_hub_download( repo_id="freedumb2000/anima-tagger-artifacts", filename=fname, repo_type="dataset", local_dir="./data", ) ``` ## How it was built 1. Tag metadata + wiki from [`NSFW-API/DanBooruTagsAndWikiDumpSept2025`](https://huggingface.co/datasets/NSFW-API/DanBooruTagsAndWikiDumpSept2025) (1.59M tags; filtered to post_count ≥ 10 → 273,025 tags). 2. Post-level tag sets from [`isek-ai/danbooru-tags-2024`](https://huggingface.co/datasets/isek-ai/danbooru-tags-2024) (streamed first 500k posts). 3. For each artist/character, compute top-12 co-occurring general tags across their posts → "style signature". 4. Format each tag as `"<name> (<category>) | aliases: ... | <wiki excerpt> | associated with: <top co-occurring tags>"` and embed with [`BAAI/bge-m3`](https://huggingface.co/BAAI/bge-m3) (fp16 on GPU, normalized, 1024-dim). 5. FAISS FlatIP index over all 273k vectors. 6. PMI table over the same post sample for character↔series pairing. Full rebuild: ~10 minutes on a modern GPU. ## License The artefacts themselves are released under CC0-1.0. The upstream Danbooru tag data is public-domain by convention. Refer to the linked source datasets for their own attribution requirements. The underlying embedding model ([`BAAI/bge-m3`](https://huggingface.co/BAAI/bge-m3)) is MIT-licensed.
提供机构:
freedumb2000
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作