KMasaki/cc12m-sam-parse-tree

Name: KMasaki/cc12m-sam-parse-tree
Creator: KMasaki
Published: 2026-03-14 08:11:01
License: 暂无描述

Hugging Face2026-03-14 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/KMasaki/cc12m-sam-parse-tree

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: jpg dtype: image - name: txt dtype: string - name: njson dtype: string - name: samlens.npy dtype: binary - name: samcat.npy dtype: binary splits: - name: train num_examples: 10968539 configs: - config_name: default data_files: - split: train path: "cc12m-train-*.tar" license: cc-by-4.0 task_categories: - zero-shot-image-classification - image-to-text - text-to-image tags: - clip - webdataset - sam - region-phrase-alignment size_categories: - 10M<n<100M --- # CC12M with SAM Regions and Parse-Tree Phrases Pre-processed [CC12M](https://github.com/google-research-datasets/conceptual-12m) dataset for training [PowerCLIP](https://github.com/KMasaki/PowerCLIP). Each sample contains the original image and caption plus two precomputed annotations: - **Parse-tree phrases** (`.njson`) — NP/PP/VP/S constituent phrases extracted via spaCy, with token indices aligned to OpenCLIP's `SimpleTokenizer` (CSR format). - **SAM regions** (`.samlens.npy` + `.samcat.npy`) — Segment Anything Model (SAM ViT-H) region bounding boxes converted to ViT patch-grid token indices (CSR format, patch size 16, image size 224). ## Format WebDataset tar archives (2176 shards). Each sample contains: ``` {key}.jpg # Image {key}.txt # Caption {key}.json # Metadata (original CC12M fields) {key}.njson # Parse-tree phrase indices (CSR: lengths + token IDs) {key}.samlens.npy # SAM region lengths array {key}.samcat.npy # SAM region token indices (concatenated) ``` ## Usage ```python import webdataset as wds dataset = wds.WebDataset("cc12m-train-{0000..2175}.tar") for sample in dataset: image = sample["jpg"] # raw JPEG bytes caption = sample["txt"] # caption string # SAM regions and parse-tree phrases are loaded automatically # by PowerCLIP's data pipeline ``` Or use with PowerCLIP directly: ```bash torchrun --nproc_per_node 8 -m training.main \ --train-data "cc12m-train-{0000..2175}.tar" \ ... ``` ## Source - Images & captions: [Conceptual 12M](https://github.com/google-research-datasets/conceptual-12m) (CC-BY-4.0) - SAM regions: [Segment Anything (ViT-H)](https://github.com/facebookresearch/segment-anything) - Parse-tree phrases: [spaCy](https://spacy.io/) `en_core_web_sm`

提供机构：

KMasaki

5,000+

优质数据集

54 个

任务类型

进入经典数据集