data-archetype/imagenet_22k_512_bucketable
收藏Hugging Face2026-04-13 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/data-archetype/imagenet_22k_512_bucketable
下载链接
链接失效反馈官方服务:
资源简介:
---
pretty_name: "ImageNet-22k 512-Bucketable Captioned Subset"
license: other
license_name: imagenet
license_link: https://www.image-net.org/download.php
task_categories:
- text-to-image
language:
- en
tags:
- imagenet
- webdataset
- images
- captions
- bucketed-shards
---
# ImageNet-22k 512-Bucketable Captioned Subset
This dataset is a pre-bucketed, captioned subset of [`timm/imagenet-22k-wds`](https://huggingface.co/datasets/timm/imagenet-22k-wds).
It is intended for text-to-image training and similar workflows that want images already grouped into aspect-ratio buckets near a 512-base training resolution. Images were kept only if they could fit one of the target buckets without upsampling after deterministic resize and crop.
## Summary
- Source: `timm/imagenet-22k-wds` (`fall11` ImageNet-22k WebDataset copy)
- Source coverage scanned: `train + validation`
- Source size scanned: `14,146,391` samples across `4,608` source tar archives
- Final export: `1,175,382` samples across `1,170` uncompressed tar shards
- Base resolution: `512`
- Bucket family: SDXL-style 1024-base proto buckets scaled to 512 with `divisible=32`
- Captions: complete coverage
- `1,174,216` from `google/gemini-2.5-flash-lite`
- `1,166` from `mistralai/ministral-14b-2512`
## What This Dataset Is
This is not a raw ImageNet mirror. It is a filtered export designed for training pipelines that want:
- aspect-ratio bucketed images at roughly `~512^2` scale
- no runtime upsampling
- one caption per sample already embedded in the shard
- WebDataset-style tar shards plus per-sample metadata
The export keeps images that survive the target bucket policy and drops images that would need upsampling to reach the bucket target.
## Filtering And Processing
Each retained sample was processed deterministically:
1. EXIF transpose
2. Convert to RGB
3. Bicubic cover-resize with antialiasing
4. Drop if the sample would require upsampling
5. Corner crop to the bucket target size
6. Re-encode as JPEG
Export settings:
- JPEG quality: `95`
- Subsampling policy: `adaptive_scale`
- Adaptive threshold: `0.85`
- Crop strategy: `corner`
- Allowed corners: bottom-left / bottom-right (`[2, 3]`)
Additional cleanup applied after export:
- exact duplicate source-byte images were deduplicated by SHA-256, keeping the first occurrence
- `120,179` duplicate samples were removed
- a small number of obvious `"image not available"` / heavy-overlay placeholder images were removed manually
## Buckets
Buckets follow the SDXL-style proto bucket set at a 1024 base, scaled to a 512 base resolution.
Examples:
- `p1024x1024` -> `512x512`
- `p1152x832` -> `576x416`
- `p1216x832` -> `608x416`
- `p832x1152` -> `416x576`
- `p1280x768` -> `640x384`
- `p2048x512` -> `1024x256`
The full bucket list and exact per-bucket counts are in [`manifest.json`](./manifest.json).
Largest buckets:
| bucket_id | target_w×h | count |
| --- | --- | ---: |
| `p1152x832` | `576x416` | 454,063 |
| `p1216x832` | `608x416` | 170,875 |
| `p832x1152` | `416x576` | 114,052 |
| `p1152x896` | `576x448` | 74,243 |
| `p832x1216` | `416x608` | 60,694 |
| `p1024x1024` | `512x512` | 47,420 |
## Captions
Captions were written after import into a sister SQLite workspace, then applied back into the shards with the following priority:
1. `caption_gemini_2_5_flash_lite`
2. `caption_ministral_14b_2512`
Every exported sample has a selected caption.
Per-sample metadata stores:
- `caption_variant`
- `caption_selector_index`
- `caption_source_id`
[`manifest.json`](./manifest.json) includes the `caption_sources` table for caption provenance.
## Format
This repository uses the `bucketed_shards_v1` format.
Layout:
- `manifest.json`
- `buckets/<bucket_id>/shard-*.tar`
Each tar shard contains three files per sample:
- `<key>.jpg`
- `<key>.txt`
- `<key>.json`
Per-sample JSON includes bucket/export fields plus source metadata such as:
- target size and bucket id
- source split / archive / member name
- ImageNet class metadata (`class_id`, `label`, `label_12k`, `class_name`)
- caption provenance fields
## Loading
Recommended usage is sequential tar reading or WebDataset-style loading.
```python
import webdataset as wds
ds = (
wds.WebDataset("buckets/*/shard-*.tar")
.decode("pil")
.to_tuple("jpg", "txt", "json")
)
for image, caption, meta in ds:
...
```
## Source And License
This export is derived from:
- source dataset: [`timm/imagenet-22k-wds`](https://huggingface.co/datasets/timm/imagenet-22k-wds)
- upstream homepage: <https://www.image-net.org/>
This dataset inherits the original ImageNet access terms. The upstream dataset card lists the license as `imagenet` and links to the ImageNet download / terms page:
- <https://www.image-net.org/download.php>
In practice, this means the data is generally restricted to non-commercial research and educational use under the ImageNet terms. Review the upstream terms yourself before uploading, sharing, or using this dataset.
## Export Metadata
- Created: `2026-04-12T20:26:43.583182+00:00`
- Export ID: `7fc009d81fee48be`
- Format: `bucketed_shards_v1`
- Image mode: `reencode_jpeg`
For exact machine-readable details, use [`manifest.json`](./manifest.json).
提供机构:
data-archetype



