data-archetype/LAION_Aesthetics_512_bucketed_512
收藏Hugging Face2026-04-20 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/data-archetype/LAION_Aesthetics_512_bucketed_512
下载链接
链接失效反馈官方服务:
资源简介:
---
pretty_name: "LAION Aesthetics 512 Bucketed 512 Captioned"
license: other
task_categories:
- text-to-image
language:
- en
tags:
- webdataset
- images
- captions
- laion
- bucketed-shards
---
# LAION Aesthetics 512 Bucketed 512 Captioned
This is a captioned bucketed-shards export of images from `limingcv/LAION_Aesthetics_512`.
Images were filtered and resized/cropped into SDXL-style aspect-ratio buckets at a 512 base resolution, without upsampling. The export contains `1,999,908` images across `1,976` uncompressed WebDataset-style tar shards.
The `.txt` files contain model-generated captions, not the original LAION web-scrape alt text or surrounding page text. Captions were generated from the bucketed images using OpenRouter models in this priority order:
1. `google/gemini-2.0-flash-lite-001`
2. `mistralai/mistral-medium-3.1`
Each sample's `.json` metadata records `caption_variant`, `caption_selector_index`, and `caption_source_id`. `manifest.json` records the caption source table and prompt hashes.
Placeholder, unavailable-image, and blank-image samples caught during caption QA were removed before this export was finalized.
## Technical details
This repository contains a **bucketed-shards** export (uncompressed TAR shards).
## Format
- **Format**: `bucketed_shards_v1`
- **Created**: `2026-02-27T11:42:10.753994+00:00`
- **Export ID**: `ede6a33a8b304ed1`
- **Manifest**: `manifest.json`
- **Image mode**: `reencode_jpeg`
Directory layout:
- `manifest.json` (global metadata + per-bucket shard listing)
- `buckets/<bucket_id>/shard-*.tar`
Each TAR shard contains 3 files per sample:
- `<key>.jpg` (JPEG bytes; either re-encoded RGB JPEG or source JPEG passthrough depending on `image_mode`)
- `<key>.txt` (caption text, UTF-8, newline-terminated)
- `<key>.json` (per-sample metadata including `target_w`, `target_h`, `bucket_id`, `caption_variant`, `caption_selector_index`, and `caption_source_id`)
## Image preprocessing
If `image_mode=reencode_jpeg`, images are processed deterministically per-sample:
- EXIF transpose, convert to RGB
- **Cover-resize** using **torch CPU** bicubic interpolation with antialiasing (`mode=bicubic`, `antialias=True`)
- **Never upsample**: samples that would require upscaling are skipped (`too_small_policy=drop`)
- Crop to the bucket target size (`crop_strategy=corner`, allowed corners `[2, 3]`)
Resize/crop details:
- Cover scale is `scale = max(target_w / src_w, target_h / src_h)`; if `scale > 1`, the sample is skipped.
- After resize, a crop box is chosen deterministically from the sample key (sha256 of `image_id`).
- Corner strategy chooses a corner from `allowed_corners` where `0=TL, 1=TR, 2=BL, 3=BR` (optional small jitter for `corner_jitter`).
JPEG encoding:
- quality `95`
- subsampling policy `adaptive_scale` (adaptive threshold `0.85`)
If `image_mode=passthrough_jpeg`, the exporter stores the source file bytes as-is (no EXIF transpose / resize / crop / re-encode).
Bucket target metadata still refers to the planned target size for that bucket (not necessarily the encoded JPEG dimensions).
Loaders should decode the JPEG bytes, apply EXIF orientation if desired, then do resize/crop at load time.
## Buckets / resolutions
- Buckets follow **SDXL-style proto buckets** defined at a 1024×1024 base.
- Base resolution(s): `[512]`
- In single-res exports, `bucket_id` is the **proto** (1024-base) bucket, e.g. `p1024x1024`.
- In multi-res exports, buckets are namespaced by base resolution: `r<base>_<proto>`, e.g. `r512_p1024x1024`.
- The **actual target resolution** for each bucket (scaled by the per-bucket base resolution and `divisible=32`) is stored in:
- `manifest.json` → `buckets[<bucket_id>].scaled.w/h` (and `base_resolution`)
- each sample’s `<key>.json` → `target_w` / `target_h`
Bucket IDs (preview): `p1024x1024`, `p1024x960`, `p1088x896`, `p1088x960`, `p1152x832`, `p1152x896`, `p1216x832`, `p1280x768`, `p1344x704`, `p1344x768`, `p1408x704`, `p1472x704`, `p1536x640`, `p1600x640`, `p1664x576`, `p1728x576`, `p1792x576`, `p1856x512`, `p1920x512`, `p1984x512`, … (+20 more)
Bucket distribution:
| bucket_id | target_w×h | aspect | count |
| --- | --- | --- | --- |
| p1216x832 | 608×416 | 1.462 | 535,075 |
| p832x1216 | 416×608 | 0.684 | 226,268 |
| p1024x1024 | 512×512 | 1.000 | 218,348 |
| p1152x832 | 576×416 | 1.385 | 167,817 |
| p1344x768 | 672×384 | 1.750 | 138,497 |
| p832x1152 | 416×576 | 0.722 | 126,255 |
| p896x1152 | 448×576 | 0.778 | 122,382 |
| p1280x768 | 640×384 | 1.667 | 90,974 |
| p1152x896 | 576×448 | 1.286 | 90,688 |
| p896x1088 | 448×544 | 0.824 | 62,608 |
| p1088x896 | 544×448 | 1.214 | 44,030 |
| p960x1088 | 480×544 | 0.882 | 26,534 |
| p1088x960 | 544×480 | 1.133 | 23,278 |
| p1344x704 | 672×352 | 1.909 | 20,917 |
| p960x1024 | 480×512 | 0.938 | 18,569 |
| p768x1280 | 384×640 | 0.600 | 18,330 |
| p1024x960 | 512×480 | 1.067 | 17,845 |
| p1408x704 | 704×352 | 2.000 | 15,004 |
| p768x1344 | 384×672 | 0.571 | 10,823 |
| p1472x704 | 736×352 | 2.091 | 7,064 |
| p1536x640 | 768×320 | 2.400 | 6,159 |
| p704x1408 | 352×704 | 0.500 | 3,139 |
| p1600x640 | 800×320 | 2.500 | 2,867 |
| p704x1472 | 352×736 | 0.478 | 1,625 |
| p1664x576 | 832×288 | 2.889 | 1,236 |
| p1728x576 | 864×288 | 3.000 | 861 |
| p1792x576 | 896×288 | 3.111 | 626 |
| p640x1536 | 320×768 | 0.417 | 507 |
| p640x1600 | 320×800 | 0.400 | 343 |
| p1856x512 | 928×256 | 3.625 | 282 |
| p512x2048 | 256×1024 | 0.250 | 251 |
| p576x1664 | 288×832 | 0.346 | 214 |
| p576x1792 | 288×896 | 0.321 | 120 |
| p576x1728 | 288×864 | 0.333 | 100 |
| p2048x512 | 1024×256 | 4.000 | 86 |
| p512x1856 | 256×928 | 0.276 | 82 |
| p1920x512 | 960×256 | 3.750 | 39 |
| p1984x512 | 992×256 | 3.875 | 27 |
| p512x1984 | 256×992 | 0.258 | 26 |
| p512x1920 | 256×960 | 0.267 | 12 |
## Caption selection (waterfall)
Captions are selected from `dataset.sqlite` using the first matching selector (highest priority wins).
Within the same selector, the newest caption source is preferred.
Caption provenance:
- Per-sample `<key>.json` includes `caption_source_id` (int, from `dataset.sqlite`).
- `manifest.json` includes a `caption_sources` table mapping `caption_source_id` → backend/model/created_at plus prompt hashes (not prompt text).
Caption sources used:
| caption_source_id | backend | model | created_at | system_prompt_sha256 | user_prompt_sha256 |
| --- | --- | --- | --- | --- | --- |
| 1 | openrouter | google/gemini-2.0-flash-lite-001 | 1776589694 | 503ff8c1ba9c… | 6b4b2b1dc90b… |
| 2 | openrouter | mistralai/mistral-medium-3.1 | 1776684944 | 503ff8c1ba9c… | 6b4b2b1dc90b… |
Caption priority (waterfall) + planned usage:
| selector_index | variant | backend | model | planned_images |
| --- | --- | --- | --- | --- |
| 0 | caption_gemini_2_flash_lite | openrouter | google/gemini-2.0-flash-lite-001 | 1,999,846 |
| 1 | caption_mistral_medium_3_1 | openrouter | mistralai/mistral-medium-3.1 | 62 |
Available caption variants (top 30):
| selected | variant | images_with_ok_caption |
| --- | --- | --- |
| ✓ | caption_gemini_2_flash_lite | 1,999,846 |
| ✓ | caption_mistral_medium_3_1 | 62 |
Missing caption policy: `empty`
## Export summary
- images_seen: 1,999,908
- images_exported: 1,999,908
- skipped_no_caption: 0
- skipped_too_small: 0
- decode_errors: 0
- encode_errors: 0
## Efficient loading
### Recommended
Treat this as a **webdataset-style** collection of tar shards:
- Prefer **sequential reads** of tar files for throughput.
- Shuffle at the **shard level** (and optionally within-shard) for good randomness without expensive random I/O.
- Use `manifest.json` to list buckets and shards.
#### Python (`webdataset`)
```python
import webdataset as wds
urls = "buckets/*/shard-*.tar" # glob; adjust if you want a single bucket only
ds = (
wds.WebDataset(urls)
.decode("pil") # decodes .jpg to PIL.Image
.to_tuple("jpg", "txt", "json")
)
for jpg, caption, meta in ds:
...
```
#### Python (`tarfile`, no extra deps)
```python
import io, json, tarfile
from pathlib import Path
tar_path = next(Path("buckets").rglob("shard-*.tar"))
with tarfile.open(tar_path, "r") as tf:
members = tf.getmembers()
for m in members:
if not m.name.endswith(".txt"):
continue
key = m.name[:-4]
caption = tf.extractfile(m).read().decode("utf-8").strip()
meta = json.loads(tf.extractfile(tf.getmember(key + ".json")).read().decode("utf-8"))
jpg_bytes = tf.extractfile(tf.getmember(key + ".jpg")).read()
...
```
提供机构:
data-archetype



