pixelprose
收藏魔搭社区2026-05-13 更新2024-11-23 收录
下载链接:
https://modelscope.cn/datasets/swift/pixelprose
下载链接
链接失效反馈官方服务:
资源简介:
# From Pixels to Prose: A Large Dataset of Dense Image Captions
[[ **arXiv paper** ](https://arxiv.org/abs/2406.10328)] | [[ 🌮 **image tars** ](https://huggingface.co/datasets/pixelprose/pixelprose-shards)]
**PixelProse** is a comprehensive dataset of over **16M (million)** synthetically generated captions,
leveraging cutting-edge vision-language models ([Gemini 1.0 Pro Vision](https://console.cloud.google.com/vertex-ai/publishers/google/model-garden/gemini-pro-vision)) for detailed and accurate descriptions.
## 1. Details
Total number of image-caption pairs: 16,896,214 (16.9M)
- 6,538,898 (6.5M) pairs in the split of [CommonPool](https://www.datacomp.ai)
- 9,066,455 (9.1M) pairs in the split of [CC12M](https://github.com/google-research-datasets/conceptual-12m)
- 1,290,861 (1.3M) pairs in the split of [RedCaps](https://redcaps.xyz)
## 2. Download Parquet Files
The first step is to download the parquet files, containing image URLs, captions, and other variables (please check out Dataset Viewer in this repo.)
Three ways to download the parquet files are:
#### via Git LFS
```bash
# make sure you have git-lfs installed (https://git-lfs.com)
git lfs install
# w/ HTTPS
git clone https://huggingface.co/datasets/tomg-group-umd/pixelprose
# w/ SSH
git clone git@hf.co:datasets/tomg-group-umd/pixelprose
```
#### via Huggingface API
```python
from datasets import load_dataset
# for downloading the whole data
ds = load_dataset("tomg-group-umd/pixelprose")
# for downloading specific split
ds_commom_pool = load_dataset("tomg-group-umd/pixelprose", split="commonpool")
ds_cc12m = load_dataset("tomg-group-umd/pixelprose", split="cc12m")
ds_redcaps = load_dataset("tomg-group-umd/pixelprose", split="redcaps")
```
The Parquet files are stored in the Hugging Face cache directory, which is located by default at `~/.cache/huggingface/datasets`.
More info can be found [cache management](https://huggingface.co/docs/datasets/en/cache).
#### via Direct Link
Please navigate to the [data](https://huggingface.co/datasets/tomg-group-umd/pixelprose/tree/main/data) directory and click the required parquet file to download.
## 3. Download Images
The second step is to download images using the parquet files. An optional tool for this is [img2dataset](https://github.com/rom1504/img2dataset/tree/main).
You might wanna go to [[ 🌮 **image tars** ](https://huggingface.co/datasets/pixelprose/pixelprose-shards)] for grabing images.
## 4. Variables
PixelProse has multiple variable columns, which are
- `uid`: unique identifier for the image
- `url`: URL of the image
- `key`: key associated with the image
- `status`: status returned from the `vlm_model`
- `original_caption`: caption inherited from the source
- `vlm_model`: model used for captioning the image
- `vlm_caption`: PixelProse's dense caption
- `toxicity`: score for general toxic behavior or language
- `severe_toxicity`: score for extremely harmful and abusive language
- `obscene`: score for use of obscene or inappropriate language
- `identity_attack`: score for language targeting individuals or groups based on identity
- `insult`: score for language intended to insult or demean
- `threat`: score for language conveying threats of harm
- `sexual_explicit`: score for language with sexually explicit content
- `watermark_class_id`: watermark classification (`0` = image with watermark, `1` = image without watermark, `2` = image without watermark but with text).
- `watermark_class_score`: prediction score for each watermark class, ranging from `[0, 1]`
- `aesthetic_score`: aesthetic score ranging from `[0, 10]`
- `error_message`: error message returned from the `vlm_model`
- `width / height`: size of the image downloaded and used for running the `vlm_model`
- `original_width / original_height`: original size of the image
- `exif`: EXIF information of the image file
- `sha256`: SHA256 hash of the image file
- `image_id`, `author`, `subreddit`, `score`: attributes inherited from RedCaps, unavailable in CC12M and CommonPool
## 5. Contact
If you have any questions about PixelProse, please open a discussion.
Contributions via pull requests are also welcome.
# 从像素到散文:密集图像字幕大型数据集
[[ **arXiv论文** ](https://arxiv.org/abs/2406.10328)] | [[ 🌮 **图像分卷包** ](https://huggingface.co/datasets/pixelprose/pixelprose-shards)]
**PixelProse** 是一个涵盖超1600万条模型合成字幕的综合性数据集,依托前沿视觉语言模型([Gemini 1.0 Pro Vision](https://console.cloud.google.com/vertex-ai/publishers/google/model-garden/gemini-pro-vision))生成详实精准的图像描述。
## 1. 数据集详情
总图像-字幕对数量:16,896,214(约1690万)
- 6,538,898(约650万)对来自[CommonPool](https://www.datacomp.ai)划分集
- 9,066,455(约910万)对来自[CC12M](https://github.com/google-research-datasets/conceptual-12m)划分集
- 1,290,861(约130万)对来自[RedCaps](https://redcaps.xyz)划分集
## 2. 下载Parquet文件
第一步需下载Parquet文件,其中包含图像URL、字幕及其他相关变量(详见本仓库的数据集查看器)。共有三种下载Parquet文件的方式:
#### 通过Git LFS
bash
# 确保已安装git-lfs(https://git-lfs.com)
git lfs install
# HTTPS 方式
git clone https://huggingface.co/datasets/tomg-group-umd/pixelprose
# SSH 方式
git clone git@hf.co:datasets/tomg-group-umd/pixelprose
#### 通过Huggingface API
python
from datasets import load_dataset
# 下载全量数据
ds = load_dataset("tomg-group-umd/pixelprose")
# 下载指定划分集
ds_commom_pool = load_dataset("tomg-group-umd/pixelprose", split="commonpool")
ds_cc12m = load_dataset("tomg-group-umd/pixelprose", split="cc12m")
ds_redcaps = load_dataset("tomg-group-umd/pixelprose", split="redcaps")
Parquet文件默认存储于Hugging Face缓存目录,路径为`~/.cache/huggingface/datasets`。更多信息可参阅[缓存管理文档](https://huggingface.co/docs/datasets/en/cache)。
#### 通过直接链接
请前往[数据目录](https://huggingface.co/datasets/tomg-group-umd/pixelprose/tree/main/data),点击所需的Parquet文件即可下载。
## 3. 下载图像
第二步需通过Parquet文件下载图像,可选用工具[img2dataset](https://github.com/rom1504/img2dataset/tree/main)完成该操作。
你也可以直接前往[[ 🌮 **图像分卷包** ](https://huggingface.co/datasets/pixelprose/pixelprose-shards)]获取图像文件。
## 4. 字段说明
PixelProse包含以下多个数据字段:
- `uid`:图像唯一标识符
- `url`:图像的URL地址
- `key`:图像关联的键值
- `status`:视觉语言模型返回的运行状态
- `original_caption`:源数据集继承而来的原始字幕
- `vlm_model`:用于生成图像字幕的视觉语言模型
- `vlm_caption`:PixelProse生成的密集型图像字幕
- `toxicity`:通用毒性语言评分
- `severe_toxicity`:极端有害与辱骂性语言评分
- `obscene`:淫秽或不当语言使用评分
- `identity_attack`:基于身份针对个人或群体的攻击语言评分
- `insult`:旨在侮辱或贬低他人的语言评分
- `threat`:带有伤害威胁的语言评分
- `sexual_explicit`:包含性暗示内容的语言评分
- `watermark_class_id`:水印分类标识(`0` = 带水印图像,`1` = 无水印图像,`2` = 无水印但包含文本的图像)
- `watermark_class_score`:各水印分类的预测得分,取值范围为`[0, 1]`
- `aesthetic_score`:美学评分,取值范围为`[0, 10]`
- `error_message`:视觉语言模型返回的错误信息
- `width / height`:下载并用于视觉语言模型推理的图像尺寸
- `original_width / original_height`:图像原始尺寸
- `exif`:图像文件的EXIF元数据
- `sha256`:图像文件的SHA256哈希值
- `image_id`, `author`, `subreddit`, `score`:继承自RedCaps的属性,CC12M与CommonPool划分集无此类字段
## 5. 联系方式
若您对PixelProse数据集有任何疑问,请发起讨论。我们也欢迎通过提交拉取请求的方式贡献内容。
提供机构:
maas
创建时间:
2024-06-27
搜集汇总
数据集介绍

以上内容由遇见数据集搜集并总结生成



