danbooru2023-metadata-database
收藏魔搭社区2025-12-26 更新2025-08-23 收录
下载链接:
https://modelscope.cn/datasets/KBlueLeaf/danbooru2023-metadata-database
下载链接
链接失效反馈官方服务:
资源简介:
# Metadata Database for Danbooru2023
Danbooru 2023 datasets: https://huggingface.co/datasets/nyanko7/danbooru2023
The latest entry of this database is id 7,866,491. Which is newer than nyanko7's dataset.
This dataset contains a sqlite db file which have all the tags and posts metadata in it.<br>
The Peewee ORM config file is provided too, plz check it for more information. (Especially on how I link posts and tags together)
The original data is from the official dump of the posts info.<br>
Check this [link](https://console.cloud.google.com/storage/browser/danbooru_public/data) for more info.
## Format
This dataset contains 3 format but they store same contents:
* Sqlite (.db)
* have 2 versions: with/without index.
* Parquet
* Parquet files' name indicate the sqlite/duckdb table name.
* It is recommended to use post.parquet when you need to export tons of content.
* Duckdb (.duckdb)
* have 2 versions: with/without index.
`others` folder will contains some pre-exported files like tags for each post.
## Details
This section contains some details that you need to be aware of if you want to use other ORM system or use plain SQL query to utilize this database.
#### Custom Enum Fields
Some fields in Post/Tags use my custom enum field to store type/category or something like that:
* Post.rating
* 0: general
* 1: sensitive
* 2: questionable
* 3: explicit
* Tag.type
* 0: general
* 1: artist
* 2: character
* 3: copyright
* 4: meta
#### Tag List
I use peewee ManyToManyField to implement the Tag List things. Which utilize a through model which have all the pair of Tag and Post<br>
Since it is very likely we will want to use Tag to query posts, so many-to-many is better.<br>
The con of this design is the database file will be 1.5x larger than before(we have 0.25B entries for the post-tag pairs), but the query speed become 2~3x faster, so I think it is acceptable.
After done some checking, I can ensure that all the "categorical tag list" can be done by full list + filter, and that is how I done it now. Check the db.py for more details.
#### Utils
if you think above details are too complicated, just use the db_utils.py and other PeeWee API to utilize this database.
I also provide a write_csv.py for exporting whole dataset into csv for data analysis.
## License
The database files of this repo are licensed under MiT License.<br>
The source code files of this repo are licensed under Apache 2.0 License.
## Acknowledgement
Thx for AngelBottomless for updating new entries
# Danbooru2023元数据库
Danbooru 2023数据集地址:https://huggingface.co/datasets/nyanko7/danbooru2023
本数据库的最新条目编号为7,866,491,相较于nyanko7发布的原数据集更新。
本数据集包含存储了全部标签与作品元数据的SQLite数据库文件,同时附赠了Peewee对象关系映射(ORM)配置文件。如需了解更多细节(尤其是作品与标签的关联方式),可查阅该配置文件。
本数据集的原始数据源自Danbooru官方发布的作品信息备份,如需了解更多详情可访问此链接:https://console.cloud.google.com/storage/browser/danbooru_public/data
## 存储格式
本数据集提供三种存储格式,但其内部存储的内容完全一致:
* SQLite数据库文件(.db):包含带索引与无索引两个版本。
* Parquet文件:Parquet文件名对应SQLite/DuckDB的表名。当需要导出大量内容时,推荐使用post.parquet文件。
* DuckDB数据库文件(.duckdb):包含带索引与无索引两个版本。
`others` 文件夹中包含部分预导出文件,例如各作品对应的标签列表。
## 细节说明
本节将介绍部分关键细节,若您计划使用其他对象关系映射系统或直接通过原生SQL查询使用本数据库,需留意以下内容。
#### 自定义枚举字段
作品(Post)与标签(Tag)的部分字段采用了自定义枚举类型来存储类型、分类等信息:
* 作品评级(Post.rating)
* 0:普通(general)
* 1:敏感(sensitive)
* 2:可疑(questionable)
* 3:露骨(explicit)
* 标签类型(Tag.type)
* 0:通用标签(general)
* 1:画师标签(artist)
* 2:角色标签(character)
* 3:版权标签(copyright)
* 4:元标签(meta)
#### 标签列表实现
本数据集采用Peewee多对多字段(ManyToManyField)实现标签列表功能,通过中间关联模型存储所有作品与标签的配对关系。由于多数场景下需要通过标签查询对应作品,因此采用多对多关系是更优的设计方案。该设计的缺点是数据库文件体积相较原始数据增大1.5倍(作品-标签配对条目总量达2.5亿条),但查询速度提升2~3倍,因此整体方案具备可行性。
经测试验证,所有分类标签列表均可通过完整数据集结合过滤操作实现,本数据集即采用该方案。如需了解更多细节可查阅db.py文件。
#### 实用工具
若您认为上述细节过于复杂,可直接使用db_utils.py及其他Peewee API来使用本数据库。此外还提供了write_csv.py脚本,可将完整数据集导出为CSV格式以用于数据分析。
## 许可协议
本仓库的数据库文件采用MIT许可协议授权,源代码文件采用Apache 2.0许可协议授权。
## 致谢
感谢AngelBottomless负责更新最新条目。
提供机构:
maas
创建时间:
2025-08-15



