five

nyanko7/danbooru2023

收藏
Hugging Face2024-05-22 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/nyanko7/danbooru2023
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit task_categories: - image-classification - image-to-image - text-to-image language: - en - ja pretty_name: danbooru2023 size_categories: - 1M<n<10M viewer: false --- <img src="https://huggingface.co/datasets/nyanko7/danbooru2023/resolve/main/cover.webp" alt="cover" width="750"/> # Danbooru2023: A Large-Scale Crowdsourced and Tagged Anime Illustration Dataset <!-- Provide a quick summary of the dataset. --> Danbooru2023 is a large-scale anime image dataset with over 5 million images contributed and annotated in detail by an enthusiast community. Image tags cover aspects like characters, scenes, copyrights, artists, etc with an average of 30 tags per image. Danbooru is a veteran anime image board with high-quality images and extensive tag metadata. The dataset can be used to train image classification, multi-label tagging, character detection, generative models, and other computer vision tasks. - **Shared by:** Nyanko Devs - **Language(s):** English, Japanese - **License:** MIT This dataset is built on the top of [danbooru2021](https://gwern.net/danbooru2021). We expands the dataset to include images up to ID #6,857,737, adding over 1.8 million additional images and total size is now approximately 8 terabytes (8,000 GB). ## Use ## Format The goal of the dataset is to be as easy as possible to use immediately, avoiding obscure file formats, while allowing simultaneous research & seeding of the torrent, with easy updates. Images are provided in the full original form (be that JPG, PNG, GIF or otherwise) for reference/archival purposes, and bucketed into 1000 subdirectories 0000–0999 (0-padded), which is the Danbooru ID modulo 1000 (ie. all images in 0999/ have an ID ending in ‘999’); IDs can be turned into paths by dividing & padding (eg. in Bash, BUCKET=$(printf "%04d" $(( ID % 1000 )) )) and then the file is at {original,512px}/$BUCKET/$ID.$EXT. The reason for the bucketing is that a single directory would cause pathological filesystem performance, and modulo ID is a simple hash which spreads images evenly without requiring additional future directories to be made or a filesystem IO to check where the file is. The ID is not zero-padded and files end in the relevant extension, hence the file layout looks like this: ```bash $ tree / | less / ├── danbooru2023 -> /mnt/diffusionstorage/workspace/danbooru/ │ ├── metadata │ ├── readme.md │ ├── original │ │ ├── 0000 -> data-0000.tar │ │ ├── 0001 -> data-0001.tar │ │ │ ├── 10001.jpg │ │ │ ├── 210001.png │ │ │ ├── 3120001.webp │ │ │ ├── 6513001.jpg │ │ │ ├── recent │ │ ├── 0000 -> data-1000.tar │ │ ├── 0001 -> data-1001.tar │ │ │ ├── updates │ │ ├── 20240319 │ │ │ ├── dataset-0.tar │ │ │ ├── dataset-1.tar │ │ │ │ │ ├── 2024xxxx │ │ │ ├── dataset-0.tar │ │ │ ├── dataset-1.tar ``` Where `data-{1000..1999}.tar` refer to recent update files (should be updated every few months) and `updates` refer to fast patches (should be updated every few days to few weeks). Currently represented file extensions are: avi/bmp/gif/html/jpeg/jpg/mp3/mp4/mpg/pdf/png/rar/swf/webm/wmv/zip. Raw original files are treacherous. Be careful if working with the original dataset. There are many odd files: truncated, non-sRGB colorspace, wrong file extensions (eg. some PNGs have .jpg extensions like original/0146/1525146.jpg or original/0558/1422558.jpg), etc.
提供机构:
nyanko7
原始信息汇总

Danbooru2023 数据集概述

数据集规模

  • 图像数量:超过500万张

图像特征

  • 标签细节:每张图像平均包含30个标签
  • 标签内容:涵盖角色、场景、版权、艺术家等多个方面

社区贡献

  • 贡献者:由爱好者社区贡献和详细标注
搜集汇总
数据集介绍
main_image_url
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作