danbooru2023-florence2-caption
收藏魔搭社区2026-01-07 更新2024-08-31 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/danbooru2023-florence2-caption
下载链接
链接失效反馈官方服务:
资源简介:
# Danbooru2023 - Florence2 Caption dataset
This dataset contains captions of danbooru2023 images generated by microsoft/Florence-2-large <br>
I use original one with <MORE_DETAILED_CAPTION> task token
## Format
parquet:
* key: the danbooru id of the image
* parsed: parsed florence 2 output of the image
## Stat
### MORE_DETAILED_CAPTION
* Entries: 7,438,449
* Output Tokens (Min/Max/Mean/Median):
* Flan T5 Tokenizer: 19/736/120/114
* DFN CLIP Tokenizer: 19/826/108.7/103
* Qwen2 Tokenizer: 17/883/106.8/101
* Output Format:
* "The image shows ...": 690,027
* "The image is ... of ...": 6,665,897
* others: 82,525
* Time Cost: around 7~10day on 4x3090
### DETAILED_CAPTION
* Entries: 7,439,002
* Output Tokens (Min/Max/Mean/Median):
* Flan T5 Tokenizer: 10/649/56.67/55
* DFN CLIP Tokenizer: 10/742/51.06/49
* Qwen2 Tokenizer: 8/871/49.47/48
* Output Format:
* "The image shows ...": 5,739,496
* "This is an ...": 1,634,386
* others: 65,120
* Time Cost: around 4~5day on 4x3090
### Graphs
Distribution of token counts:

## License
This dataset and the provided source code are licensed under Apache-License 2.
# Danbooru2023 - Florence2 字幕数据集
本数据集包含由微软(Microsoft)开发的Florence-2-large 模型生成的Danbooru2023图像字幕。本次实验采用搭载<MORE_DETAILED_CAPTION>任务标记的原生模型输出。
## 数据格式
Parquet格式:
* 键(key):图像的Danbooru ID
* parsed:该图像经Florence 2解析后的输出
## 统计信息
### MORE_DETAILED_CAPTION(更详细字幕模式)
* 数据条目数:7,438,449
* 输出Token分布(最小值/最大值/均值/中位数):
* Flan T5 分词器:19/736/120/114
* DFN CLIP 分词器:19/826/108.7/103
* Qwen2 分词器:17/883/106.8/101
* 输出格式占比:
* 以"The image shows ..."开头:690,027条
* 以"The image is ... of ..."开头:6,665,897条
* 其他格式:82,525条
* 运行耗时:在4块NVIDIA RTX 3090显卡上运行约7~10天
### DETAILED_CAPTION(详细字幕模式)
* 数据条目数:7,439,002
* 输出Token分布(最小值/最大值/均值/中位数):
* Flan T5 分词器:10/649/56.67/55
* DFN CLIP 分词器:10/742/51.06/49
* Qwen2 分词器:8/871/49.47/48
* 输出格式占比:
* 以"The image shows ..."开头:5,739,496条
* 以"This is an ..."开头:1,634,386条
* 其他格式:65,120条
* 运行耗时:在4块NVIDIA RTX 3090显卡上运行约4~5天
### 可视化图表
Token数量分布:

## 许可证
本数据集及配套源代码采用Apache许可证2.0(Apache-License 2.0)授权。
提供机构:
maas
创建时间:
2024-07-10



