emanuelevivoli/comix-v0_1-pages-tiny
收藏Hugging Face2025-11-19 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/emanuelevivoli/comix-v0_1-pages-tiny
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc0-1.0
task_categories:
- image-to-text
- object-detection
- image-segmentation
tags:
- comics
- computer-vision
- panel-detection
- dcm-22k
size_categories:
- 100K<n<1M
---
# Comic Books Dataset v0.1 - Pages
**Full dataset of comic book pages from Digital Comic Museum.**
This is the PRODUCTION dataset. For testing, use `comix_v0_tiny_pages`.
## Dataset Description
- **Total Pages**: 3521
- **Source**: Digital Comic Museum (DCM-22K)
- **Format**: WebDataset (tar files)
- **License**: Public Domain (CC0-1.0)
- **Version**: 0.1 (Group-based processing)
## What's Included
Each page has:
- `{page_id}.jpg` - Page image
- `{page_id}.json` - Metadata (detections, captions, page class)
- `{page_id}.seg.npz` - Segmentation masks (SAMv2) [when available]
## Quick Start
```python
from datasets import load_dataset
# Load pages dataset
pages = load_dataset(
"emanuelevivoli/comix-v0_1-pages",
split="train",
streaming=True
)
# Iterate through pages
for page in pages:
# Metadata
metadata = page["json"]
book_id = metadata["book_id"]
page_number = metadata["page_number"]
page_class = metadata["page_class"] # Story, Cover, Ads
# Image
image = page["jpg"] # PIL Image
# Detections
if "detections" in metadata:
panels = metadata["detections"].get("fasterrcnn", {}).get("panels", [])
characters = metadata["detections"].get("fasterrcnn", {}).get("characters", [])
faces = metadata["detections"].get("fasterrcnn", {}).get("faces", [])
textboxes = metadata["detections"].get("fasterrcnn", {}).get("textboxes", [])
# Segmentation masks (if available)
if "seg.npz" in page and metadata.get("has_segmentation"):
import numpy as np
seg_data = np.load(page["seg.npz"])
```
## Dataset Structure
### Page JSON Schema
```json
{
"page_id": "c00004_p006",
"book_id": "c00004",
"page_number": 6,
"page_class": "Story",
"split": "train",
"image": {
"file": "c00004_p006.jpg",
"width": 1280,
"height": 1920
},
"detections": {
"fasterrcnn": {
"panels": [...],
"characters": [...],
"faces": [...],
"textboxes": [...]
}
},
"has_captions": true,
"has_features": true,
"has_masks": true,
"has_segmentation": true,
"segmentation_info": {
"available": true,
"model": "SAMv2",
"type": "mask",
"file": "c00004_p006.seg.npz"
}
}
```
## Data Splits
| Split | Pages |
|-------|-------|
| Train | 3285 |
| Validation | 88 |
| Test | 148 |
| **Total** | **3521** |
**Split Strategy**: Books are assigned to splits based on MD5 hash matching with C100 and DCM benchmark datasets.
## Use Cases
✅ **Panel Detection**: Train models to detect comic panels
✅ **Character Recognition**: Identify and track characters
✅ **Text Extraction**: Detect and extract textboxes and speech bubbles
✅ **Page Classification**: Classify pages as Story, Cover, or Ads
✅ **Segmentation**: Use SAMv2 masks for panel and character segmentation
✅ **Captioning**: Generate captions for panels and pages
## Companion Dataset
**comix-v0_1-books**: Book-level metadata for this dataset
## Known Issues
For this `dataset v0.1` we have a few issues:
- Tar file `00580` has "unexpected end of file" problem
- Some panels don't have captions, and captions might be of poor quality (with `Molmo-72B int4`)
- Poor detections might happen (`fasterrcnn` and `magiv1` are not very good)
- Segmentations are done with SAMv2 prompted with `fasterrcnn` detection, thus can be poor
We will solve these issues in future versions.
## Processing Pipeline
1. **Detection**: FasterRCNN for panels, characters, faces, and textboxes
2. **Segmentation**: SAMv2 with FasterRCNN prompts
3. **Captioning**: Molmo-72B int4 for panel captions
4. **Features**: Visual features extracted for each panel
## Citation
```bibtex
@dataset{comix_v0_1_pages_2025,
title={Comic Books Dataset v0.1 - Pages},
author={Emanuele Vivoli},
year={2025},
publisher={Hugging Face},
note={Production dataset - DCM-22K source},
url={https://huggingface.co/datasets/emanuelevivoli/comix-v0_1-pages}
}
```
## License
Public Domain (CC0-1.0) - Digital Comic Museum
## Updates
- **v0.1 (2025-11-19)**: Initial release
- 3521 pages from DCM-22K
- Group-based processing (15 groups)
- Split-organized tar files
- SAMv2 segmentation masks
提供机构:
emanuelevivoli



