five

emanuelevivoli/comix-v0_1-pages-tiny

收藏
Hugging Face2025-11-19 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/emanuelevivoli/comix-v0_1-pages-tiny
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc0-1.0 task_categories: - image-to-text - object-detection - image-segmentation tags: - comics - computer-vision - panel-detection - dcm-22k size_categories: - 100K<n<1M --- # Comic Books Dataset v0.1 - Pages **Full dataset of comic book pages from Digital Comic Museum.** This is the PRODUCTION dataset. For testing, use `comix_v0_tiny_pages`. ## Dataset Description - **Total Pages**: 3521 - **Source**: Digital Comic Museum (DCM-22K) - **Format**: WebDataset (tar files) - **License**: Public Domain (CC0-1.0) - **Version**: 0.1 (Group-based processing) ## What's Included Each page has: - `{page_id}.jpg` - Page image - `{page_id}.json` - Metadata (detections, captions, page class) - `{page_id}.seg.npz` - Segmentation masks (SAMv2) [when available] ## Quick Start ```python from datasets import load_dataset # Load pages dataset pages = load_dataset( "emanuelevivoli/comix-v0_1-pages", split="train", streaming=True ) # Iterate through pages for page in pages: # Metadata metadata = page["json"] book_id = metadata["book_id"] page_number = metadata["page_number"] page_class = metadata["page_class"] # Story, Cover, Ads # Image image = page["jpg"] # PIL Image # Detections if "detections" in metadata: panels = metadata["detections"].get("fasterrcnn", {}).get("panels", []) characters = metadata["detections"].get("fasterrcnn", {}).get("characters", []) faces = metadata["detections"].get("fasterrcnn", {}).get("faces", []) textboxes = metadata["detections"].get("fasterrcnn", {}).get("textboxes", []) # Segmentation masks (if available) if "seg.npz" in page and metadata.get("has_segmentation"): import numpy as np seg_data = np.load(page["seg.npz"]) ``` ## Dataset Structure ### Page JSON Schema ```json { "page_id": "c00004_p006", "book_id": "c00004", "page_number": 6, "page_class": "Story", "split": "train", "image": { "file": "c00004_p006.jpg", "width": 1280, "height": 1920 }, "detections": { "fasterrcnn": { "panels": [...], "characters": [...], "faces": [...], "textboxes": [...] } }, "has_captions": true, "has_features": true, "has_masks": true, "has_segmentation": true, "segmentation_info": { "available": true, "model": "SAMv2", "type": "mask", "file": "c00004_p006.seg.npz" } } ``` ## Data Splits | Split | Pages | |-------|-------| | Train | 3285 | | Validation | 88 | | Test | 148 | | **Total** | **3521** | **Split Strategy**: Books are assigned to splits based on MD5 hash matching with C100 and DCM benchmark datasets. ## Use Cases ✅ **Panel Detection**: Train models to detect comic panels ✅ **Character Recognition**: Identify and track characters ✅ **Text Extraction**: Detect and extract textboxes and speech bubbles ✅ **Page Classification**: Classify pages as Story, Cover, or Ads ✅ **Segmentation**: Use SAMv2 masks for panel and character segmentation ✅ **Captioning**: Generate captions for panels and pages ## Companion Dataset **comix-v0_1-books**: Book-level metadata for this dataset ## Known Issues For this `dataset v0.1` we have a few issues: - Tar file `00580` has "unexpected end of file" problem - Some panels don't have captions, and captions might be of poor quality (with `Molmo-72B int4`) - Poor detections might happen (`fasterrcnn` and `magiv1` are not very good) - Segmentations are done with SAMv2 prompted with `fasterrcnn` detection, thus can be poor We will solve these issues in future versions. ## Processing Pipeline 1. **Detection**: FasterRCNN for panels, characters, faces, and textboxes 2. **Segmentation**: SAMv2 with FasterRCNN prompts 3. **Captioning**: Molmo-72B int4 for panel captions 4. **Features**: Visual features extracted for each panel ## Citation ```bibtex @dataset{comix_v0_1_pages_2025, title={Comic Books Dataset v0.1 - Pages}, author={Emanuele Vivoli}, year={2025}, publisher={Hugging Face}, note={Production dataset - DCM-22K source}, url={https://huggingface.co/datasets/emanuelevivoli/comix-v0_1-pages} } ``` ## License Public Domain (CC0-1.0) - Digital Comic Museum ## Updates - **v0.1 (2025-11-19)**: Initial release - 3521 pages from DCM-22K - Group-based processing (15 groups) - Split-organized tar files - SAMv2 segmentation masks
提供机构:
emanuelevivoli
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作