five

data-archetype/LAION_Aesthetics_512_bucketed_512

收藏
Hugging Face2026-04-20 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/data-archetype/LAION_Aesthetics_512_bucketed_512
下载链接
链接失效反馈
官方服务:
资源简介:
--- pretty_name: "LAION Aesthetics 512 Bucketed 512 Captioned" license: other task_categories: - text-to-image language: - en tags: - webdataset - images - captions - laion - bucketed-shards --- # LAION Aesthetics 512 Bucketed 512 Captioned This is a captioned bucketed-shards export of images from `limingcv/LAION_Aesthetics_512`. Images were filtered and resized/cropped into SDXL-style aspect-ratio buckets at a 512 base resolution, without upsampling. The export contains `1,999,908` images across `1,976` uncompressed WebDataset-style tar shards. The `.txt` files contain model-generated captions, not the original LAION web-scrape alt text or surrounding page text. Captions were generated from the bucketed images using OpenRouter models in this priority order: 1. `google/gemini-2.0-flash-lite-001` 2. `mistralai/mistral-medium-3.1` Each sample's `.json` metadata records `caption_variant`, `caption_selector_index`, and `caption_source_id`. `manifest.json` records the caption source table and prompt hashes. Placeholder, unavailable-image, and blank-image samples caught during caption QA were removed before this export was finalized. ## Technical details This repository contains a **bucketed-shards** export (uncompressed TAR shards). ## Format - **Format**: `bucketed_shards_v1` - **Created**: `2026-02-27T11:42:10.753994+00:00` - **Export ID**: `ede6a33a8b304ed1` - **Manifest**: `manifest.json` - **Image mode**: `reencode_jpeg` Directory layout: - `manifest.json` (global metadata + per-bucket shard listing) - `buckets/<bucket_id>/shard-*.tar` Each TAR shard contains 3 files per sample: - `<key>.jpg` (JPEG bytes; either re-encoded RGB JPEG or source JPEG passthrough depending on `image_mode`) - `<key>.txt` (caption text, UTF-8, newline-terminated) - `<key>.json` (per-sample metadata including `target_w`, `target_h`, `bucket_id`, `caption_variant`, `caption_selector_index`, and `caption_source_id`) ## Image preprocessing If `image_mode=reencode_jpeg`, images are processed deterministically per-sample: - EXIF transpose, convert to RGB - **Cover-resize** using **torch CPU** bicubic interpolation with antialiasing (`mode=bicubic`, `antialias=True`) - **Never upsample**: samples that would require upscaling are skipped (`too_small_policy=drop`) - Crop to the bucket target size (`crop_strategy=corner`, allowed corners `[2, 3]`) Resize/crop details: - Cover scale is `scale = max(target_w / src_w, target_h / src_h)`; if `scale > 1`, the sample is skipped. - After resize, a crop box is chosen deterministically from the sample key (sha256 of `image_id`). - Corner strategy chooses a corner from `allowed_corners` where `0=TL, 1=TR, 2=BL, 3=BR` (optional small jitter for `corner_jitter`). JPEG encoding: - quality `95` - subsampling policy `adaptive_scale` (adaptive threshold `0.85`) If `image_mode=passthrough_jpeg`, the exporter stores the source file bytes as-is (no EXIF transpose / resize / crop / re-encode). Bucket target metadata still refers to the planned target size for that bucket (not necessarily the encoded JPEG dimensions). Loaders should decode the JPEG bytes, apply EXIF orientation if desired, then do resize/crop at load time. ## Buckets / resolutions - Buckets follow **SDXL-style proto buckets** defined at a 1024×1024 base. - Base resolution(s): `[512]` - In single-res exports, `bucket_id` is the **proto** (1024-base) bucket, e.g. `p1024x1024`. - In multi-res exports, buckets are namespaced by base resolution: `r<base>_<proto>`, e.g. `r512_p1024x1024`. - The **actual target resolution** for each bucket (scaled by the per-bucket base resolution and `divisible=32`) is stored in: - `manifest.json` → `buckets[<bucket_id>].scaled.w/h` (and `base_resolution`) - each sample’s `<key>.json` → `target_w` / `target_h` Bucket IDs (preview): `p1024x1024`, `p1024x960`, `p1088x896`, `p1088x960`, `p1152x832`, `p1152x896`, `p1216x832`, `p1280x768`, `p1344x704`, `p1344x768`, `p1408x704`, `p1472x704`, `p1536x640`, `p1600x640`, `p1664x576`, `p1728x576`, `p1792x576`, `p1856x512`, `p1920x512`, `p1984x512`, … (+20 more) Bucket distribution: | bucket_id | target_w×h | aspect | count | | --- | --- | --- | --- | | p1216x832 | 608×416 | 1.462 | 535,075 | | p832x1216 | 416×608 | 0.684 | 226,268 | | p1024x1024 | 512×512 | 1.000 | 218,348 | | p1152x832 | 576×416 | 1.385 | 167,817 | | p1344x768 | 672×384 | 1.750 | 138,497 | | p832x1152 | 416×576 | 0.722 | 126,255 | | p896x1152 | 448×576 | 0.778 | 122,382 | | p1280x768 | 640×384 | 1.667 | 90,974 | | p1152x896 | 576×448 | 1.286 | 90,688 | | p896x1088 | 448×544 | 0.824 | 62,608 | | p1088x896 | 544×448 | 1.214 | 44,030 | | p960x1088 | 480×544 | 0.882 | 26,534 | | p1088x960 | 544×480 | 1.133 | 23,278 | | p1344x704 | 672×352 | 1.909 | 20,917 | | p960x1024 | 480×512 | 0.938 | 18,569 | | p768x1280 | 384×640 | 0.600 | 18,330 | | p1024x960 | 512×480 | 1.067 | 17,845 | | p1408x704 | 704×352 | 2.000 | 15,004 | | p768x1344 | 384×672 | 0.571 | 10,823 | | p1472x704 | 736×352 | 2.091 | 7,064 | | p1536x640 | 768×320 | 2.400 | 6,159 | | p704x1408 | 352×704 | 0.500 | 3,139 | | p1600x640 | 800×320 | 2.500 | 2,867 | | p704x1472 | 352×736 | 0.478 | 1,625 | | p1664x576 | 832×288 | 2.889 | 1,236 | | p1728x576 | 864×288 | 3.000 | 861 | | p1792x576 | 896×288 | 3.111 | 626 | | p640x1536 | 320×768 | 0.417 | 507 | | p640x1600 | 320×800 | 0.400 | 343 | | p1856x512 | 928×256 | 3.625 | 282 | | p512x2048 | 256×1024 | 0.250 | 251 | | p576x1664 | 288×832 | 0.346 | 214 | | p576x1792 | 288×896 | 0.321 | 120 | | p576x1728 | 288×864 | 0.333 | 100 | | p2048x512 | 1024×256 | 4.000 | 86 | | p512x1856 | 256×928 | 0.276 | 82 | | p1920x512 | 960×256 | 3.750 | 39 | | p1984x512 | 992×256 | 3.875 | 27 | | p512x1984 | 256×992 | 0.258 | 26 | | p512x1920 | 256×960 | 0.267 | 12 | ## Caption selection (waterfall) Captions are selected from `dataset.sqlite` using the first matching selector (highest priority wins). Within the same selector, the newest caption source is preferred. Caption provenance: - Per-sample `<key>.json` includes `caption_source_id` (int, from `dataset.sqlite`). - `manifest.json` includes a `caption_sources` table mapping `caption_source_id` → backend/model/created_at plus prompt hashes (not prompt text). Caption sources used: | caption_source_id | backend | model | created_at | system_prompt_sha256 | user_prompt_sha256 | | --- | --- | --- | --- | --- | --- | | 1 | openrouter | google/gemini-2.0-flash-lite-001 | 1776589694 | 503ff8c1ba9c… | 6b4b2b1dc90b… | | 2 | openrouter | mistralai/mistral-medium-3.1 | 1776684944 | 503ff8c1ba9c… | 6b4b2b1dc90b… | Caption priority (waterfall) + planned usage: | selector_index | variant | backend | model | planned_images | | --- | --- | --- | --- | --- | | 0 | caption_gemini_2_flash_lite | openrouter | google/gemini-2.0-flash-lite-001 | 1,999,846 | | 1 | caption_mistral_medium_3_1 | openrouter | mistralai/mistral-medium-3.1 | 62 | Available caption variants (top 30): | selected | variant | images_with_ok_caption | | --- | --- | --- | | ✓ | caption_gemini_2_flash_lite | 1,999,846 | | ✓ | caption_mistral_medium_3_1 | 62 | Missing caption policy: `empty` ## Export summary - images_seen: 1,999,908 - images_exported: 1,999,908 - skipped_no_caption: 0 - skipped_too_small: 0 - decode_errors: 0 - encode_errors: 0 ## Efficient loading ### Recommended Treat this as a **webdataset-style** collection of tar shards: - Prefer **sequential reads** of tar files for throughput. - Shuffle at the **shard level** (and optionally within-shard) for good randomness without expensive random I/O. - Use `manifest.json` to list buckets and shards. #### Python (`webdataset`) ```python import webdataset as wds urls = "buckets/*/shard-*.tar" # glob; adjust if you want a single bucket only ds = ( wds.WebDataset(urls) .decode("pil") # decodes .jpg to PIL.Image .to_tuple("jpg", "txt", "json") ) for jpg, caption, meta in ds: ... ``` #### Python (`tarfile`, no extra deps) ```python import io, json, tarfile from pathlib import Path tar_path = next(Path("buckets").rglob("shard-*.tar")) with tarfile.open(tar_path, "r") as tf: members = tf.getmembers() for m in members: if not m.name.endswith(".txt"): continue key = m.name[:-4] caption = tf.extractfile(m).read().decode("utf-8").strip() meta = json.loads(tf.extractfile(tf.getmember(key + ".json")).read().decode("utf-8")) jpg_bytes = tf.extractfile(tf.getmember(key + ".jpg")).read() ... ```
提供机构:
data-archetype
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作