Azily/Macro-Dataset

Name: Azily/Macro-Dataset
Creator: Azily
Published: 2026-03-26 13:47:32
License: 暂无描述

Hugging Face2026-03-26 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/Azily/Macro-Dataset

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-4.0 task_categories: - image-to-image - text-to-image language: - en tags: - multi-reference - image-generation - customization - illustration - spatial - temporal - benchmark pretty_name: "MACRO: Advancing Multi-Reference Image Generation with Structured Long-Context Data" size_categories: - 100K<n<1M --- # MACRO: Advancing Multi-Reference Image Generation with Structured Long-Context Data **MACRO** is a large-scale benchmark and training dataset for multi-reference image generation. It covers **four task categories** and **four image-count brackets**, providing both training splits and a curated evaluation benchmark. ## Dataset Summary | Task | Train samples (per category) | Eval samples (per category) | |------|------------------------------|-----------------------------| | **Customization** | 1-3: 20,000 / 4-5: 20,000 / 6-7: 30,000 / ≥8: 30,000 | 250 each | | **Illustration** | 25,000 each | 250 each | | **Spatial** | 25,000 each | 250 each | | **Temporal** | 25,000 each | 250 each | **Total:** ~400,000 training samples · 4,000 evaluation samples ### Task Categories | Category | Description | |----------|-------------| | **Customization** | Generate images preserving specific subjects (objects, persons, styles) from reference images | | **Illustration** | Generate illustrations conditioned on multiple reference images | | **Spatial** | Generate images respecting spatial relationships between objects in references | | **Temporal** | Generate images reflecting temporal or sequential changes across references | ### Image-Count Brackets Each task is further split by the number of reference images required: | Bracket | Reference images | |---------|-----------------| | `1-3` | 1 to 3 | | `4-5` | 4 to 5 | | `6-7` | 6 to 7 | | `>=8` | 8 or more | --- ## Repository Contents This dataset is distributed as a collection of `.tar.gz` archives for efficient download. Each archive can be extracted independently. ### Metadata & Index | Archive | Contents | |---------|----------| | `filter.tar.gz` | `data/filter/` — all JSON index files for train/eval samples (~510 MB uncompressed) | | `raw_t2i_example.tar.gz` | `data/raw/t2i_example/` — placeholder T2I JSONL + sample images | | `extract_data.sh` | Shell script to extract all archives back to the original `data/` layout | ### Raw Source Images (`data/raw/customization/`) Original source images used during data construction, split by subcategory: | Archive | Contents | |---------|----------| | `raw_customization_cloth.tar.gz` | `data/raw/customization/cloth/` + `cloth_train.jsonl` + `cloth_eval.jsonl` | | `raw_customization_human.tar.gz` | `data/raw/customization/human/` + `human_train.jsonl` + `human_eval.jsonl` | | `raw_customization_object.tar.gz` | `data/raw/customization/object/` + `object_train.jsonl` + `object_eval.jsonl` | | `raw_customization_scene.tar.gz` | `data/raw/customization/scene/` + `scene_train.jsonl` + `scene_eval.jsonl` | | `raw_customization_style.tar.gz` | `data/raw/customization/style/` + `style_train.jsonl` + `style_eval.jsonl` | ### Image Data (`data/final/`) Each `data/final/{task}/{split}/{category}/` slice is split into chunks of **5,000 sample subdirectories**. Archives follow this naming pattern: ``` final_{task}_{split}_{category}_{start}_{end}.tar.gz ``` where `{start}` and `{end}` are zero-padded 5-digit indices (e.g. `00000_04999`). Each chunk contains both the `data/<subdir>/` image directories **and** the corresponding `json/<subdir>.json` metadata files for that chunk, so every archive is self-contained. For the **spatial** task (which has an extra scene layer — `indoor`, `object`, `outdoor`): ``` final_spatial_{split}_{scene}_{category}_{start}_{end}.tar.gz ``` Examples: | Archive | Contents | |---------|----------| | `final_customization_train_1-3_00000_04999.tar.gz` | First 5,000 samples of `data/final/customization/train/1-3/data/` + `json/` | | `final_customization_train_1-3_05000_09999.tar.gz` | Next 5,000 samples | | `final_customization_train__ge8_00000_04999.tar.gz` | First 5,000 samples of `data/final/customization/train/>=8/data/` + `json/` | | `final_spatial_train_indoor_1-3_00000_04999.tar.gz` | First 5,000 samples of `data/final/spatial/train/indoor/1-3/` | | `final_temporal_eval_1-3_00000_00499.tar.gz` | All 500 eval samples of `data/final/temporal/eval/1-3/` | > **Note on `>=8` in filenames:** the `>=` is encoded as `_ge` in archive names, so `>=8` becomes `_ge8`. --- ## Directory Structure (after extraction) ``` data/ ├── filter/ # JSON index files (used for training & eval) │ ├── customization/ │ │ ├── train/ │ │ │ ├── 1-3/ *.json # 20,000 training samples │ │ │ ├── 4-5/ *.json # 20,000 training samples │ │ │ ├── 6-7/ *.json # 30,000 training samples │ │ │ └── >=8/ *.json # 30,000 training samples │ │ └── eval/ │ │ ├── 1-3/ *.json # 250 eval samples │ │ ├── 4-5/ *.json # 250 eval samples │ │ ├── 6-7/ *.json # 250 eval samples │ │ └── >=8/ *.json # 250 eval samples │ ├── illustration/ (same layout as customization) │ ├── spatial/ (same layout as customization) │ └── temporal/ (same layout as customization) ├── final/ # Actual image data │ ├── customization/ # layout: {split}/{cat}/data/ + json/ │ │ ├── train/ │ │ │ ├── 1-3/ │ │ │ │ ├── data/ │ │ │ │ │ ├── 00000000/ │ │ │ │ │ │ ├── image_1.jpg │ │ │ │ │ │ ├── image_2.jpg (etc.) │ │ │ │ │ │ └── image_output.jpg │ │ │ │ │ └── ... │ │ │ │ └── json/ *.json (per-sample generation metadata) │ │ │ ├── 4-5/ ... │ │ │ ├── 6-7/ ... │ │ │ └── >=8/ ... │ │ └── eval/ ... │ ├── illustration/ ... (same layout as customization) │ ├── spatial/ # extra scene layer: {split}/{scene}/{cat}/ │ │ ├── train/ │ │ │ ├── indoor/ │ │ │ │ ├── 1-3/ data/ + json/ │ │ │ │ ├── 4-5/ ... │ │ │ │ ├── 6-7/ ... │ │ │ │ └── >=8/ ... │ │ │ ├── object/ ... │ │ │ └── outdoor/ ... │ │ └── eval/ ... │ └── temporal/ ... (same layout as customization) └── raw/ ├── t2i_example/ │ ├── t2i_example.jsonl # Placeholder T2I prompts (for training format reference) │ └── images/ # Placeholder images └── customization/ # Original source images (customization) ├── cloth/ *.jpg ├── human/ *.jpg ├── object/ *.jpg ├── scene/ *.jpg ├── style/ *.jpg └── *_train.jsonl / *_eval.jsonl ``` --- ## JSON Sample Format Each file in `data/filter/` contains a single JSON object: ```json { "task": "customization", "idx": 1, "prompt": "Create an image of the modern glass and metal interior from <image 2>, applying the classical oil painting style from <image 1> globally across the entire scene.", "input_images": [ "data/final/customization/train/1-3/data/00022018/image_1.jpg", "data/final/customization/train/1-3/data/00022018/image_2.jpg" ], "output_image": "data/final/customization/train/1-3/data/00022018/image_output.jpg" } ``` All image paths in the JSON files are **relative to the root of the extracted data directory** (i.e., relative to the parent of `data/`). --- ## Download & Setup ### Download all archives ```bash huggingface-cli download Azily/Macro-Dataset --repo-type dataset --local-dir data_tar/ ``` ### Extract `extract_data.sh` is included in the downloaded `data_tar/` folder. Run it from the project root: ```bash bash data_tar/extract_data.sh ./data_tar . # This restores: ./data/filter/, ./data/final/, ./data/raw/ ``` Or extract manually: ```bash for f in data_tar/*.tar.gz; do tar -xzf "$f" -C .; done ``` --- ## Selective Download If you only need the evaluation benchmark (no images), download just `filter.tar.gz`: ```bash huggingface-cli download Azily/Macro-Dataset \ --repo-type dataset \ --include "filter.tar.gz" \ --local-dir data_tar/ tar -xzf data_tar/filter.tar.gz -C . ``` To download a specific task/split/category (e.g., all chunks of customization train 1-3): ```bash huggingface-cli download Azily/Macro-Dataset \ --repo-type dataset \ --include "final_customization_train_1-3_*.tar.gz" \ --local-dir data_tar/ for f in data_tar/final_customization_train_1-3_*.tar.gz; do tar -xzf "$f" -C .; done ``` --- ## License This dataset is released under the [Creative Commons Attribution 4.0 International (CC BY 4.0)](https://creativecommons.org/licenses/by/4.0/) license.

提供机构：

Azily

5,000+

优质数据集

54 个

任务类型

进入经典数据集