Project-Ground-Zero/pixelvision-670k-caption
收藏Hugging Face2026-02-25 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Project-Ground-Zero/pixelvision-670k-caption
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
task_categories:
- text-to-image
- image-text-to-text
language:
- en
size_categories:
- 1M<n<10M
---
# Pixilart Captioned Parquet Dataset
This folder contains the Pixilart caption dataset.
It includes structured metadata and annotation outputs, but does not include image binaries.
Caption is done by gemini 2.5 flash lite and gemini 3 flash.
Special thanks to @xiaoqianWX for API keys and credit!
## Files
- `pixilart_full_publish.parquet`
- `pixilart_top10k_publish.parquet`
- `manifest.json`
Generated at (UTC): `2026-02-23T20:28:39+00:00`
## Dataset Stats
- Full split (`pixilart_full_publish.parquet`)
- Rows: `564,819`
- `has_error=true`: `734`
- `is_rejected=true`: `734`
- `metadata_missing=true`: `0`
- Top10k split (`pixilart_top10k_publish.parquet`)
- Rows: `10,000`
- `has_error=true`: `650`
- `is_rejected=true`: `1`
- `metadata_missing=true`: `0`
- Note: build input had 20,000 rows and was deduplicated by `source_tar + source_stem`.
## Field Definitions
### Annotation and Tracking Fields
- `id`: Original id (nullable)
- `source_tar`: Source tar relative path
- `source_stem`: Sample stem key (join key to source metadata in tar)
- `image_file`: Image filename
- `caption`: VLM annotation text
- `error`: Failure/error message
- `model`: Annotation model name
- `annotated_at`: Annotation timestamp (ISO-8601 string)
- `has_error`: Whether `error` is non-empty
- `is_rejected`: Whether this sample is classified as rejected content
- `rejection_reason`: Rejection category (`content_policy` or null)
- `metadata_missing`: Whether source metadata join failed
### Compatibility Fields (Names Kept As-Is)
For downstream parser compatibility, these two column names are unchanged, but in this pixilart pipeline their meanings are:
- `tag_string_general`: Description hint text (from metadata description and related fields)
- `tag_string_character`: Original filename hint (prefer original name, fallback to current filename)
### Source Metadata Fields
- `metadata_json`: Full raw metadata JSON string
- Metadata is also expanded into `meta_*` columns for direct SQL/DataFrame usage:
- `meta_subset`, `meta_sequence`, `meta_subset_sequence_element`
- `meta_title`, `meta_description`
- `meta_views`, `meta_filename`, `meta_pixel_size`
- `meta_has_watermark`, `meta_image_hash`
- `meta_image_url`, `meta_full_image_url`
- `meta_likes_count`, `meta_comments_count`
- `meta_width`, `meta_height`, `meta_date_created`
- `meta_content_warning`, `meta_warning`, `meta_liked`
- `meta_source_type`, `meta_source_id`, `meta_art_id`, `meta_unqid`
- `meta_created_at`, `meta_updated_at`
- `meta_user_id`, `meta_username`, `meta_is_gif`
- `meta_image_filename`, `meta_image_path`
## Image Retrieval
This release package does not include image binaries (size and licensing constraints).
To fetch images yourself, use:
- `meta_image_url` or `meta_full_image_url`
- plus source identifiers such as `meta_source_id` and `meta_art_id` if needed
## Minimal Usage Example
```python
import pyarrow.parquet as pq
import pyarrow.compute as pc
table = pq.read_table("release/pixilart-parquet/pixilart_full_publish.parquet")
# Keep only successful, non-rejected rows
ok = pc.and_(
pc.invert(table["has_error"]),
pc.invert(table["is_rejected"]),
)
clean = table.filter(ok)
print("all rows:", table.num_rows)
print("clean rows:", clean.num_rows)
```
许可证:Apache-2.0
任务类别:
- 文本到图像
- 图像-文本到文本
语言:
- 英语
数据规模:
- 100万<样本数<1000万
# 带标注的Pixilart Parquet数据集
本文件夹包含Pixilart标注数据集。
该数据集包含结构化元数据与标注结果,但不包含图像二进制文件。
标注文本由Gemini 2.5 Flash Lite与Gemini 3 Flash生成。
特别感谢@xiaoqianWX提供API密钥与相关授权!
## 文件列表
- `pixilart_full_publish.parquet`
- `pixilart_top10k_publish.parquet`
- `manifest.json`
生成时间(UTC):`2026-02-23T20:28:39+00:00`
## 数据集统计信息
### 全量拆分集(`pixilart_full_publish.parquet`)
- 样本行数:`564,819`
- 标记`has_error=true`的样本:`734`
- 标记`is_rejected=true`的样本:`734`
- 标记`metadata_missing=true`的样本:`0`
### Top10k拆分集(`pixilart_top10k_publish.parquet`)
- 样本行数:`10,000`
- 标记`has_error=true`的样本:`650`
- 标记`is_rejected=true`的样本:`1`
- 标记`metadata_missing=true`的样本:`0`
- 注:原始构建输入包含20,000行样本,通过`source_tar + source_stem`进行了去重处理。
## 字段定义
### 标注与跟踪字段
- `id`:原始样本ID(可为空)
- `source_tar`:源tar包相对路径
- `source_stem`:样本键值(用于与tar包内源元数据关联的连接键)
- `image_file`:图像文件名
- `caption`:视觉语言模型(Vision-Language Model, VLM)标注文本
- `error`:失败/错误信息
- `model`:标注模型名称
- `annotated_at`:标注时间戳(ISO-8601格式字符串)
- `has_error`:标记`error`字段是否非空
- `is_rejected`:标记该样本是否被归类为违规内容
- `rejection_reason`:违规原因分类(取值为`content_policy`或空)
- `metadata_missing`:标记源元数据关联是否失败
### 兼容性字段(保留原始列名)
为兼容下游解析器,以下两个列名未做修改,但在本Pixilart数据处理流程中,其实际含义如下:
- `tag_string_general`:描述提示文本(源自元数据的描述字段及相关字段)
- `tag_string_character`:原始文件名提示(优先使用原始名称,若不存在则回退至当前文件名)
### 源元数据字段
- `metadata_json`:完整原始元数据JSON字符串
元数据同时被展开为`meta_*`格式的列,以方便直接通过SQL或DataFrame进行操作:
- `meta_subset`、`meta_sequence`、`meta_subset_sequence_element`
- `meta_title`、`meta_description`
- `meta_views`、`meta_filename`、`meta_pixel_size`
- `meta_has_watermark`、`meta_image_hash`
- `meta_image_url`、`meta_full_image_url`
- `meta_likes_count`、`meta_comments_count`
- `meta_width`、`meta_height`、`meta_date_created`
- `meta_content_warning`、`meta_warning`、`meta_liked`
- `meta_source_type`、`meta_source_id`、`meta_art_id`、`meta_unqid`
- `meta_created_at`、`meta_updated_at`
- `meta_user_id`、`meta_username`、`meta_is_gif`
- `meta_image_filename`、`meta_image_path`
## 图像获取方式
本发布包未包含图像二进制文件(受限于文件大小与许可证要求)。
如需自行获取图像,请使用以下方式:
- `meta_image_url` 或 `meta_full_image_url`
- 若有需要,可额外结合`meta_source_id`与`meta_art_id`等源标识符进行定位。
## 极简使用示例
python
import pyarrow.parquet as pq
import pyarrow.compute as pc
table = pq.read_table("release/pixilart-parquet/pixilart_full_publish.parquet")
# 仅保留成功且未被标记为违规的样本
ok = pc.and_(
pc.invert(table["has_error"]),
pc.invert(table["is_rejected"]),
)
clean = table.filter(ok)
print("all rows:", table.num_rows)
print("clean rows:", clean.num_rows)
提供机构:
Project-Ground-Zero



