five

data-archetype/imagenet_22k_512_bucketable

收藏
Hugging Face2026-04-13 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/data-archetype/imagenet_22k_512_bucketable
下载链接
链接失效反馈
官方服务:
资源简介:
--- pretty_name: "ImageNet-22k 512-Bucketable Captioned Subset" license: other license_name: imagenet license_link: https://www.image-net.org/download.php task_categories: - text-to-image language: - en tags: - imagenet - webdataset - images - captions - bucketed-shards --- # ImageNet-22k 512-Bucketable Captioned Subset This dataset is a pre-bucketed, captioned subset of [`timm/imagenet-22k-wds`](https://huggingface.co/datasets/timm/imagenet-22k-wds). It is intended for text-to-image training and similar workflows that want images already grouped into aspect-ratio buckets near a 512-base training resolution. Images were kept only if they could fit one of the target buckets without upsampling after deterministic resize and crop. ## Summary - Source: `timm/imagenet-22k-wds` (`fall11` ImageNet-22k WebDataset copy) - Source coverage scanned: `train + validation` - Source size scanned: `14,146,391` samples across `4,608` source tar archives - Final export: `1,175,382` samples across `1,170` uncompressed tar shards - Base resolution: `512` - Bucket family: SDXL-style 1024-base proto buckets scaled to 512 with `divisible=32` - Captions: complete coverage - `1,174,216` from `google/gemini-2.5-flash-lite` - `1,166` from `mistralai/ministral-14b-2512` ## What This Dataset Is This is not a raw ImageNet mirror. It is a filtered export designed for training pipelines that want: - aspect-ratio bucketed images at roughly `~512^2` scale - no runtime upsampling - one caption per sample already embedded in the shard - WebDataset-style tar shards plus per-sample metadata The export keeps images that survive the target bucket policy and drops images that would need upsampling to reach the bucket target. ## Filtering And Processing Each retained sample was processed deterministically: 1. EXIF transpose 2. Convert to RGB 3. Bicubic cover-resize with antialiasing 4. Drop if the sample would require upsampling 5. Corner crop to the bucket target size 6. Re-encode as JPEG Export settings: - JPEG quality: `95` - Subsampling policy: `adaptive_scale` - Adaptive threshold: `0.85` - Crop strategy: `corner` - Allowed corners: bottom-left / bottom-right (`[2, 3]`) Additional cleanup applied after export: - exact duplicate source-byte images were deduplicated by SHA-256, keeping the first occurrence - `120,179` duplicate samples were removed - a small number of obvious `"image not available"` / heavy-overlay placeholder images were removed manually ## Buckets Buckets follow the SDXL-style proto bucket set at a 1024 base, scaled to a 512 base resolution. Examples: - `p1024x1024` -> `512x512` - `p1152x832` -> `576x416` - `p1216x832` -> `608x416` - `p832x1152` -> `416x576` - `p1280x768` -> `640x384` - `p2048x512` -> `1024x256` The full bucket list and exact per-bucket counts are in [`manifest.json`](./manifest.json). Largest buckets: | bucket_id | target_w×h | count | | --- | --- | ---: | | `p1152x832` | `576x416` | 454,063 | | `p1216x832` | `608x416` | 170,875 | | `p832x1152` | `416x576` | 114,052 | | `p1152x896` | `576x448` | 74,243 | | `p832x1216` | `416x608` | 60,694 | | `p1024x1024` | `512x512` | 47,420 | ## Captions Captions were written after import into a sister SQLite workspace, then applied back into the shards with the following priority: 1. `caption_gemini_2_5_flash_lite` 2. `caption_ministral_14b_2512` Every exported sample has a selected caption. Per-sample metadata stores: - `caption_variant` - `caption_selector_index` - `caption_source_id` [`manifest.json`](./manifest.json) includes the `caption_sources` table for caption provenance. ## Format This repository uses the `bucketed_shards_v1` format. Layout: - `manifest.json` - `buckets/<bucket_id>/shard-*.tar` Each tar shard contains three files per sample: - `<key>.jpg` - `<key>.txt` - `<key>.json` Per-sample JSON includes bucket/export fields plus source metadata such as: - target size and bucket id - source split / archive / member name - ImageNet class metadata (`class_id`, `label`, `label_12k`, `class_name`) - caption provenance fields ## Loading Recommended usage is sequential tar reading or WebDataset-style loading. ```python import webdataset as wds ds = ( wds.WebDataset("buckets/*/shard-*.tar") .decode("pil") .to_tuple("jpg", "txt", "json") ) for image, caption, meta in ds: ... ``` ## Source And License This export is derived from: - source dataset: [`timm/imagenet-22k-wds`](https://huggingface.co/datasets/timm/imagenet-22k-wds) - upstream homepage: <https://www.image-net.org/> This dataset inherits the original ImageNet access terms. The upstream dataset card lists the license as `imagenet` and links to the ImageNet download / terms page: - <https://www.image-net.org/download.php> In practice, this means the data is generally restricted to non-commercial research and educational use under the ImageNet terms. Review the upstream terms yourself before uploading, sharing, or using this dataset. ## Export Metadata - Created: `2026-04-12T20:26:43.583182+00:00` - Export ID: `7fc009d81fee48be` - Format: `bucketed_shards_v1` - Image mode: `reencode_jpeg` For exact machine-readable details, use [`manifest.json`](./manifest.json).
提供机构:
data-archetype
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作