panopstor/nvflickritw-cogvlm-captions

Name: panopstor/nvflickritw-cogvlm-captions
Creator: panopstor
Published: 2024-02-29 20:29:52
License: 暂无描述

Hugging Face2024-02-29 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/panopstor/nvflickritw-cogvlm-captions

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc0-1.0 --- This dataset is captions-only for 45k images from the Nvidia Flickr "In the wild" dataset. (https://github.com/NVlabs/ffhq-dataset). Captions here are provided here under CC0 license as I believe model outputs for all captioning models used do not fall under the models' licenses. Check the Nvidia flickr dataset URL for information on use restrictions and copyright for the images in the dataset itself. Captions are in .txt with the same basename as the associated image. Created using CogVLM chat model. (https://huggingface.co/THUDM/cogvl). CogVLM captions were run on an RTX 6000 Ada taking a few days as each takes 5-8 seconds. Script to run: `https://github.com/victorchall/EveryDream2trainer/blob/main/caption_cog.py` Command used: ```python caption_cog.py --image_dir /mnt/q/mldata/nvidia-flickr-itw --num_beams 3 --top_k 45 --top_p 0.9 --temp 0.95 --prompt "Write a concise, accurate, blunt, and detailed description. Avoid euphemisms, vague wording, or ambiguous expressions. Do not exceed 21 words." ``` Captions from blip1 beam, blip1 nucleus, and blip2 6.7b (default) are also provided. See: https://github.com/salesforce/LAVIS for information on BLIP and BLIP2. The BLIP 1/2 captions were run quite a while ago, and to be honest I don't recall full details. Raw .txt files are provided in zip files chunked by 1000 images each for use with img/txt pair file-based dataloaders, or shoving into webdataset tar. These correspond to the original data set which is provided as images only as `[00000..44999].png`. Parquet file should be obvious from there and you can integrate or transform as needed.

提供机构：

panopstor

原始信息汇总

数据集概述

数据集内容

该数据集包含45,000张来自Nvidia Flickr "In the wild"数据集的图像的描述文本。
描述文本以.txt文件形式提供，文件名与对应图像的文件名相同。

数据集生成

描述文本使用CogVLM chat模型生成。
生成过程在RTX 6000 Ada显卡上进行，每张图像的描述生成耗时5-8秒，整个过程持续数天。

生成脚本和命令

生成脚本位于：https://github.com/victorchall/EveryDream2trainer/blob/main/caption_cog.py
使用的命令如下： python python caption_cog.py --image_dir /mnt/q/mldata/nvidia-flickr-itw --num_beams 3 --top_k 45 --top_p 0.9 --temp 0.95 --prompt "Write a concise, accurate, blunt, and detailed description. Avoid euphemisms, vague wording, or ambiguous expressions. Do not exceed 21 words."

其他描述文本

还提供了来自blip1 beam、blip1 nucleus和blip2 6.7b（默认）的描述文本。
BLIP和BLIP2的相关信息可在https://github.com/salesforce/LAVIS找到。

文件格式

原始.txt文件以zip文件形式提供，每1000张图像为一个分块，适用于基于图像/文本对的文件型数据加载器或webdataset tar文件。
对应的原始数据集仅包含图像，文件名为[00000..44999].png。
Parquet文件格式可根据需要进行集成或转换。

5,000+

优质数据集

54 个

任务类型

进入经典数据集