data-archetype/cc12_imagenet21k_recap_hq_bucketed
收藏Hugging Face2026-01-11 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/data-archetype/cc12_imagenet21k_recap_hq_bucketed
下载链接
链接失效反馈官方服务:
资源简介:
---
pretty_name: "cc12_imagenet21k_recap_hq_bucketed"
# Edit the following fields before uploading, if desired.
license: other
task_categories:
- text-to-image
language:
- en
tags:
- webdataset
- images
- captions
---
# cc12_imagenet21k_recap_hq_bucketed
- Title: cc12_imagenet21k_recap_hq_bucketed
- Description: This ~18M rows dataset is a re upload of https://huggingface.co/datasets/gmongaras/CC12M_and_Imagenet21K_Recap_Highqual where the images have
been pre bucketed into SDXL style aspect ratio buckets for target training at ~512^2 and ~256^2 pixels, and where about 7M rows were recaptioned with either Gemini or Ministral.
To avoid re encoding the images they have been left untouched so cropping and resizing must be done at loading time
---
## Technical details
This repository contains a **bucketed-shards** export (uncompressed TAR shards).
## Format
- **Format**: `bucketed_shards_v2`
- **Created**: `2026-01-10T15:53:34.486914+00:00`
- **Export ID**: `export-2026-01-10T15:53:34.486914+00:00`
- **Manifest**: `manifest.json`
- **Image mode**: `passthrough_jpeg`
Directory layout:
- `manifest.json` (global metadata + per-bucket shard listing)
- `buckets/<bucket_id>/shard-*.tar`
Each TAR shard contains 3 files per sample:
- `<key>.jpg` (JPEG bytes; either re-encoded RGB JPEG or source JPEG passthrough depending on `image_mode`)
- `<key>.txt` (caption text, UTF-8, newline-terminated)
- `<key>.json` (per-sample metadata: `w`, `h`, `jpeg`, `image_mode`, `caption_variant`, `caption_selector_index`, `caption_source_id`)
## Image preprocessing
Unlike other datasets available in this repo, the images have been left unprocesed and are only pre bucketed into target aspect ratio buckets.
Resize/crop intended usage details at load time:
- Cover scale is `scale = max(target_w / src_w, target_h / src_h)`; if `scale > 1`, the sample is skipped.
- After resize, a crop box is chosen deterministically from the sample key (sha256 of `image_id`).
- Corner strategy chooses a corner from `allowed_corners` where `0=TL, 1=TR, 2=BL, 3=BR` (optional small jitter for `corner_jitter`).
## Buckets / resolutions
- Buckets follow **SDXL-style proto buckets** defined at a 1024×1024 base.
- Base resolution(s): `[512, 256]`
- In single-res exports, `bucket_id` is the **proto** (1024-base) bucket, e.g. `p1024x1024`.
- In multi-res exports, buckets are namespaced by base resolution: `r<base>_<proto>`, e.g. `r512_p1024x1024`.
- The **actual target resolution** for each bucket (scaled by the per-bucket base resolution and `divisible=32`) is stored in:
- `manifest.json` → `buckets[<bucket_id>].scaled.w/h` (and `base_resolution`)
- each sample’s `<key>.json` → `w/h`
Bucket IDs (preview): `r256_p1024x1024`, `r256_p1088x896`, `r256_p1152x896`, `r256_p1216x832`, `r256_p1344x704`, `r256_p1344x768`, `r256_p1472x704`, `r256_p1600x640`, `r256_p1728x576`, `r256_p1856x512`, `r256_p1984x512`, `r256_p2048x512`, `r256_p512x1920`, `r256_p512x2048`, `r256_p576x1664`, `r256_p576x1792`, `r256_p640x1536`, `r256_p704x1408`, `r256_p768x1280`, `r256_p832x1152`, … (+42 more)
Bucket distribution:
| bucket_id | target_w×h | aspect | count |
| --- | --- | --- | --- |
| r256_p1152x896 | 288×224 | 1.286 | 2,917,754 |
| r256_p1216x832 | 288×192 | 1.500 | 2,194,582 |
| r512_p1216x832 | 608×416 | 1.462 | 2,103,325 |
| r512_p1152x832 | 576×416 | 1.385 | 1,670,669 |
| r512_p1024x1024 | 512×512 | 1.000 | 1,371,409 |
| r256_p1024x1024 | 256×256 | 1.000 | 1,076,075 |
| r256_p896x1152 | 224×288 | 0.778 | 973,834 |
| r256_p832x1152 | 192×288 | 0.667 | 852,564 |
| r512_p832x1216 | 416×608 | 0.684 | 776,575 |
| r512_p832x1152 | 416×576 | 0.722 | 600,705 |
| r512_p1344x768 | 672×384 | 1.750 | 598,503 |
| r512_p1152x896 | 576×448 | 1.286 | 347,970 |
| r256_p1088x896 | 256×224 | 1.143 | 333,188 |
| r512_p1280x768 | 640×384 | 1.667 | 327,133 |
| r512_p896x1152 | 448×576 | 0.778 | 310,259 |
| r256_p960x1024 | 224×256 | 0.875 | 237,259 |
| r256_p1344x768 | 320×192 | 1.667 | 210,921 |
| r512_p1088x896 | 544×448 | 1.214 | 158,051 |
| r512_p896x1088 | 448×544 | 0.824 | 151,487 |
| r512_p960x1024 | 480×512 | 0.938 | 151,345 |
| r512_p768x1280 | 384×640 | 0.600 | 110,424 |
| r512_p1344x704 | 672×352 | 1.909 | 107,761 |
| r512_p1088x960 | 544×480 | 1.133 | 103,963 |
| r512_p1024x960 | 512×480 | 1.067 | 101,368 |
| r512_p960x1088 | 480×544 | 0.882 | 93,788 |
| r256_p768x1280 | 192×320 | 0.600 | 88,633 |
| r256_p1344x704 | 320×160 | 2.000 | 84,077 |
| r512_p768x1344 | 384×672 | 0.571 | 71,153 |
| r512_p1408x704 | 704×352 | 2.000 | 67,854 |
| r512_p1472x704 | 736×352 | 2.091 | 41,942 |
| r512_p1536x640 | 768×320 | 2.400 | 32,080 |
| r256_p704x1408 | 160×352 | 0.455 | 29,786 |
| r256_p1600x640 | 384×160 | 2.400 | 28,130 |
| r512_p704x1408 | 352×704 | 0.500 | 26,929 |
| r256_p1472x704 | 352×160 | 2.200 | 24,696 |
| r256_p1728x576 | 416×128 | 3.250 | 14,366 |
| r512_p704x1472 | 352×736 | 0.478 | 13,622 |
| r256_p640x1536 | 160×384 | 0.417 | 11,144 |
| r512_p640x1536 | 320×768 | 0.417 | 8,459 |
| r512_p1600x640 | 800×320 | 2.500 | 7,885 |
| r256_p576x1664 | 128×416 | 0.308 | 4,034 |
| r256_p2048x512 | 512×128 | 4.000 | 3,650 |
| r512_p640x1600 | 320×800 | 0.400 | 3,129 |
| r256_p1856x512 | 448×128 | 3.500 | 2,886 |
| r256_p1984x512 | 480×128 | 3.750 | 2,369 |
| r512_p1664x576 | 832×288 | 2.889 | 1,293 |
| r512_p576x1664 | 288×832 | 0.346 | 913 |
| r512_p1792x576 | 896×288 | 3.111 | 732 |
| r256_p512x2048 | 128×512 | 0.250 | 730 |
| r256_p576x1792 | 128×448 | 0.286 | 729 |
| r256_p512x1920 | 128×480 | 0.267 | 565 |
| r512_p1856x512 | 928×256 | 3.625 | 522 |
| r512_p576x1792 | 288×896 | 0.321 | 518 |
| r512_p512x1856 | 256×928 | 0.276 | 476 |
| r512_p1728x576 | 864×288 | 3.000 | 450 |
| r512_p576x1728 | 288×864 | 0.333 | 313 |
| r512_p1920x512 | 960×256 | 3.750 | 136 |
| r512_p512x1920 | 256×960 | 0.267 | 114 |
| r512_p1984x512 | 992×256 | 3.875 | 88 |
| r512_p512x1984 | 256×992 | 0.258 | 74 |
| r512_p512x2048 | 256×1024 | 0.250 | 59 |
| r512_p2048x512 | 1024×256 | 4.000 | 56 |
## Caption selection
Available caption variants
| selected | variant | images_with_ok_caption |
| --- | --- | --- |
| ✓ | caption_original | 18,655,051 |
| ✓ | caption_ministral_14b_2512 | 4,000,718 |
| ✓ | caption_gemini | 3,299,328 |
Missing caption policy: `drop`
## Export summary
- images_seen: 18,655,051
- images_exported: 18,455,504
- skipped_no_caption: 0
- skipped_too_small: 199,547
- decode_errors: 0
- encode_errors: 0
## Efficient loading
### Recommended
Treat this as a **webdataset-style** collection of tar shards:
- Prefer **sequential reads** of tar files for throughput.
- Shuffle at the **shard level** (and optionally within-shard) for good randomness without expensive random I/O.
- Use `manifest.json` to list buckets and shards.
#### Python (`webdataset`)
```python
import webdataset as wds
urls = "buckets/*/shard-*.tar" # glob; adjust if you want a single bucket only
ds = (
wds.WebDataset(urls)
.decode("pil") # decodes .jpg to PIL.Image
.to_tuple("jpg", "txt", "json")
)
for jpg, caption, meta in ds:
...
```
#### Python (`tarfile`, no extra deps)
```python
import io, json, tarfile
from pathlib import Path
tar_path = next(Path("buckets").rglob("shard-*.tar"))
with tarfile.open(tar_path, "r") as tf:
members = tf.getmembers()
for m in members:
if not m.name.endswith(".txt"):
continue
key = m.name[:-4]
caption = tf.extractfile(m).read().decode("utf-8").strip()
meta = json.loads(tf.extractfile(tf.getmember(key + ".json")).read().decode("utf-8"))
jpg_bytes = tf.extractfile(tf.getmember(key + ".jpg")).read()
...
```
提供机构:
data-archetype



