vcbe123/MultiID-2M
收藏Hugging Face2026-01-20 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/vcbe123/MultiID-2M
下载链接
链接失效反馈官方服务:
资源简介:
---
license: other
task_categories:
- text-to-image
license_name: multiid-2m
license_link: LICENSE.md
language:
- en
size_categories:
- 1M<n<10M
tags:
- face-generation
- identity-preserving
- diffusion
- controllable-generation
- multi-person
---
# MultiID-2M
[](https://arxiv.org/abs/2510.14975)
[](https://doby-xu.github.io/WithAnyone/)
[](https://huggingface.co/WithAnyone/WithAnyone)
[](https://huggingface.co/datasets/WithAnyone/MultiID-2M)
[](https://huggingface.co/datasets/WithAnyone/MultiID-Bench)
[](https://github.com/Doby-Xu/WithAnyone)
<p align="center">
<img src="https://github.com/Doby-Xu/WithAnyone/blob/main/assets/withanyone.gif?raw=true" alt="WithAnyone in action" width="800"/>
</p>
This repository contains the **MultiID-2M** dataset, a large-scale paired dataset specifically constructed for multi-person scenarios in identity-consistent image generation. It provides diverse references for each identity, enabling the development of advanced diffusion-based models like WithAnyone, which aim to mitigate "copy-paste" artifacts and improve controllability over pose and expression in generated images.
- **Paper:** [WithAnyone: Towards Controllable and ID Consistent Image Generation](https://huggingface.co/papers/2510.14975)
- **Code:** [https://github.com/Doby-Xu/WithAnyone](https://github.com/Doby-Xu/WithAnyone)
- **Project Page:** [https://doby-xu.github.io/WithAnyone/](https://doby-xu.github.io/WithAnyone/)
## Paper Abstract
The abstract of the paper is the following:
Identity-consistent generation has become an important focus in text-to-image research, with recent models achieving notable success in producing images aligned with a reference identity. Yet, the scarcity of large-scale paired datasets containing multiple images of the same individual forces most approaches to adopt reconstruction-based training. This reliance often leads to a failure mode we term copy-paste, where the model directly replicates the reference face rather than preserving identity across natural variations in pose, expression, or lighting. Such over-similarity undermines controllability and limits the expressive power of generation. To address these limitations, we (1) construct a large-scale paired dataset MultiID-2M, tailored for multi-person scenarios, providing diverse references for each identity; (2) introduce a benchmark that quantifies both copy-paste artifacts and the trade-off between identity fidelity and variation; and (3) propose a novel training paradigm with a contrastive identity loss that leverages paired data to balance fidelity with diversity. These contributions culminate in WithAnyone, a diffusion-based model that effectively mitigates copy-paste while preserving high identity similarity. Extensive qualitative and quantitative experiments demonstrate that WithAnyone significantly reduces copy-paste artifacts, improves controllability over pose and expression, and maintains strong perceptual quality. User studies further validate that our method achieves high identity fidelity while enabling expressive controllable generation.
| <img src="assets/stat1.jpg" width="100%"> | <img src="assets/stat2.jpg" width="83%"> |
|:--:|:--:|
## Download
Currently, 1M images and their metadata are available for download.
[HuggingFace Dataset](https://huggingface.co/datasets/WithAnyone/MultiID-2M)
## File Structure
```
MultiID-2M/
├── ref/
│ ├── cluster_centers.tar
│ └── tars/ # reference tars
│ ├── ...
│
├── train_rec/ # reconstruction training data
│ ├── re_000000.tar
│ ├── re_000001.tar
│ └── ...
│
└── train_cp/ # identifiable paired data
├── re_000000.tar
├── re_000001.tar
└── ...
```
- `ref/cluster_centers.tar`: Contains the cluster centers of all the identifiable identities in the dataset.
- `ref/tars`: Contains the reference images for each identifiable identity.
- `train_cp`: Contains the training images only of the identifiable identities.
- `train_rec`: Contains the training images of both identifiable and unidentifiable identities.
## Labels
The dataset contains dense labels for each image, including:
- `url`: The original URL of the original image.
- `ram_score`: Scores from recognize anything model.
- `bboxes`: Bounding boxes of detected faces.
- `aesthetics_score`: Aesthetic score of the image.
- `caption_en`: English caption generated by VLMs.
- `name`: ID number of the identifiable identity (if identifiable, otherwise `none`).
- `embeddings` (or `embedding`): Face embeddings extracted using ArcFace antelopev2 model. This corresponds to the bboxes.
## Sample Usage
This section provides instructions for quickly getting started with the `WithAnyone` model, which can be trained using this dataset.
### Requirements
Use `pip install -r requirements.txt` to install the necessary packages.
### Gradio Demo
The Gradio GUI demo is a good starting point to experiment with WithAnyone. Run it with:
```bash
python gradio_app.py --flux_path <path to flux1-dev directory> --ipa_path <path to withanyone directory> \
--clip_path <path to clip-vit-large-patch14> \
--t5_path <path to xflux_text_encoders> \
--siglip_path <path to siglip-base-patch16-256-i18n> \
--model_type "flux-dev" # or "flux-kontext" for WithAnyone.K
```
❗ WithAnyone requires face bounding boxes (bboxes). You should provide them to indicate where faces are. You can provide face bboxes in two ways:
1. Upload an example image with desired face locations in `Mask Configuration (Option 1: Automatic)`. The face bboxes will be extracted automatically, and faces will be generated in the same locations. Do not worry if the given image has a different resolution or aspect ratio; the face bboxes will be resized accordingly.
2. Input face bboxes directly in `Mask Configuration (Option 2: Manual)`. The format is `x1,y1,x2,y2` for each face, one per line.
3. <span style="color: #999;">(NOT recommended) leave both options empty, and the face bboxes will be randomly chosen from a pre-defined set. </span>
⭕ WithAnyone works well with LoRA. If you have any stylized LoRA checkpoints, use `--additional_lora_ckpt <path to lora checkpoint>` when launching the demo. The LoRA will be merged into the diffusion model.
```bash
python gradio_app.py --flux_path <path to flux1-dev directory> --ipa_path <path to withanyone directory> \
--additional_lora_ckpt <path to lora checkpoint> \
--lora_scale 0.8 # adjust the weight as needed
```
### Batch Inference
You can use `infer_withanyone.py` for batch inference. The script supports generating multiple images with MultiID-Bench.
First, download MultiID-Bench:
```bash
huggingface-cli download WithAnyone/MultiID-Bench --repo-type dataset --local-dir <path to MultiID-Bench directory>
```
And convert the parquet file to a folder of images and a json file using `MultiID_Bench/parquet2bench.py`:
```bash
python MultiID_Bench/parquet2bench.py --parquet <path to parquet file> --output_dir <path to output directory>
```
You will get a folder with the following structure:
```
<output_dir>/
├── p1/untar
├── p2/untar
├── p3/
├── p1.json
├── p2.json
└── p3.json
```
Then run batch inference with:
```bash
python infer_withanyone.py \
--eval_json_path <path to MultiID-Bench subset json> \
--data_root <path to MultiID-Bench subset images> \
--save_path <path to save results> \
--use_matting True \ # set to True when siglip_weight > 0.0
--siglip_weight 0.0 \ # Resemblance in Spirit vs Resemblance in Form, higher means more similar to reference
--id_weight 1.0 \ # usually, set it to 1 - id_weight, higher means more controllable
--t5_path <path to xflux_text_encoders> \
--clip_path <path to clip-vit-large-patch14> \
--ipa_path <path to withanyone> \
--flux_path <path to flux1-dev>
```
Where the `data_root` should be `p1/untar`, `p2/untar`, or `p3/` depending on which subset you want to evaluate. The `eval_json_path` should be the corresponding json file converted from the parquet file.
### Face Edit with FLUX.1 Kontext
You can use `gradio_edit.py` for face editing with FLUX.1 Kontext and WithAnyone.Ke.
```bash
python gradio_edit.py --flux_path <path to flux1-dev directory> --ipa_path <path to withanyone directory> \
--clip_path <path to clip-vit-large-patch14> \
--t5_path <path to xflux_text_encoders> \
--siglip_path <path to siglip-base-patch16-256-i18n> \
--model_type "flux-kontext"
```
## License and Disclaimer
This dataset is provided for non-commercial academic research purposes only. By accessing or using this dataset you agree to the terms in the [LICENSE](./LICENSE.md).
- **No ownership claim**: The project does not claim ownership of the original images, metadata, or other content included in this dataset. Copyright and other rights remain with the original rights holders.
- **User responsibility**: Users are responsible for ensuring their use of the dataset complies with all applicable laws, regulations, and third‑party terms (including platform policies).
- **Takedown / correction requests**: If a rights holder believes content in this dataset infringes their rights, please submit a removal or correction request via the [HuggingFace dataset page](https://huggingface.co/datasets/WithAnyone/MultiID-2M) or the [project page](https://doby-xu.github.io/WithAnyone/), including sufficient proof of ownership and specific identifiers/URLs. After verification of a valid claim, we will remove or correct the affected items as soon as reasonably practicable.
- **No warranty; limitation of liability**: The dataset is provided "as is" without warranties of any kind. The project and maintainers disclaim liability for any direct, indirect, incidental, or consequential damages arising from use of the dataset.
- **Prohibited commercial use**: Commercial use is prohibited unless you obtain separate permission from the dataset maintainers; unauthorized commercial use may result in legal liability.
- **Contact**: Use the HuggingFace dataset page or the project website to submit requests or questions.
---
license: 其他
task_categories:
- 文本到图像生成(text-to-image)
license_name: multiid-2m
license_link: LICENSE.md
language:
- 英语
size_categories:
- 100万 < 样本数量 < 1000万
tags:
- 人脸生成(face-generation)
- 身份保持(identity-preserving)
- 扩散模型(diffusion)
- 可控生成(controllable-generation)
- 多人物(multi-person)
---
# MultiID-2M
[](https://arxiv.org/abs/2510.14975)
[](https://doby-xu.github.io/WithAnyone/)
[](https://huggingface.co/WithAnyone/WithAnyone)
[](https://huggingface.co/datasets/WithAnyone/MultiID-2M)
[](https://huggingface.co/datasets/WithAnyone/MultiID-Bench)
[](https://github.com/Doby-Xu/WithAnyone)
<p align="center">
<img src="https://github.com/Doby-Xu/WithAnyone/blob/main/assets/withanyone.gif?raw=true" alt="WithAnyone运行演示" width="800"/>
</p>
本仓库包含**MultiID-2M**数据集,这是一款专为多人物身份一致图像生成场景构建的大规模配对数据集。该数据集为每个身份提供多样化的参考样本,可支撑诸如WithAnyone这类先进的基于扩散模型(diffusion-based)的生成方法研发,此类方法旨在缓解「复制粘贴」伪影问题,并提升生成图像中对人物姿态与表情的可控性。
- **论文**:[WithAnyone: 实现可控且身份一致的图像生成](https://huggingface.co/papers/2510.14975)
- **代码**:[https://github.com/Doby-Xu/WithAnyone](https://github.com/Doby-Xu/WithAnyone)
- **项目页面**:[https://doby-xu.github.io/WithAnyone/](https://doby-xu.github.io/WithAnyone/)
## 论文摘要
论文摘要如下:
身份一致的图像生成已成为文本到图像生成(text-to-image)研究的重要方向,近期的诸多模型在生成与参考身份对齐的图像方面已取得显著进展。然而,包含同一人物多张图像的大规模配对数据集的匮乏,迫使绝大多数现有方法采用基于重建的训练方式。这种依赖往往会导致一种被我们称为「复制粘贴」的失效模式:模型直接复制参考人脸,而非在姿态、表情或光照的自然变化下保持身份一致性。这种过度相似性会破坏生成过程的可控性,并限制生成模型的表达能力。为解决上述局限,我们完成了三项工作:(1) 构建专为多人物场景设计的大规模配对数据集MultiID-2M,为每个身份提供多样化参考样本;(2) 推出一款基准测试集,可量化「复制粘贴」伪影程度以及身份保真度与多样性之间的权衡关系;(3) 提出一种全新的训练范式,搭配对比身份损失函数,利用配对数据平衡身份保真度与生成多样性。上述研究成果最终催生了WithAnyone模型,这是一款基于扩散模型的生成方法,可在保持高身份相似度的同时有效缓解「复制粘贴」伪影问题。大量定性与定量实验结果表明,WithAnyone可显著减少「复制粘贴」伪影,提升对姿态与表情的可控性,并保持优异的感知生成质量。用户调研进一步验证了我们的方法在实现高身份保真度的同时,可实现富有表现力的可控生成任务。
| <img src="assets/stat1.jpg" width="100%"> | <img src="assets/stat2.jpg" width="83%"> |
|:--:|:--:|
## 下载
目前,已有100万张图像及其元数据可供下载。
[HuggingFace 数据集](https://huggingface.co/datasets/WithAnyone/MultiID-2M)
## 文件结构
MultiID-2M/
├── ref/
│ ├── cluster_centers.tar
│ └── tars/ # 参考图像打包文件
│ ├── ...
│
├── train_rec/ # 重建训练数据集
│ ├── re_000000.tar
│ ├── re_000001.tar
│ └── ...
│
└── train_cp/ # 可识别身份配对训练数据集
├── re_000000.tar
├── re_000001.tar
└── ...
- `ref/cluster_centers.tar`:包含数据集中所有可识别身份的聚类中心。
- `ref/tars`:包含每个可识别身份的参考图像。
- `train_cp`:仅包含可识别身份的训练图像。
- `train_rec`:包含可识别与不可识别身份的训练图像。
## 标注信息
本数据集为每张图像提供了丰富的标注信息,具体包括:
- `url`:原始图像的来源URL。
- `ram_score`:识别任意模型(Recognize Anything Model, RAM)的输出评分。
- `bboxes`:检测到的人脸边界框。
- `aesthetics_score`:图像的美学评分。
- `caption_en`:由视觉语言模型(Visual Language Model, VLM)生成的英文描述文本。
- `name`:可识别身份的编号(若身份不可识别,则标注为`none`)。
- `embeddings`(或`embedding`):使用ArcFace antelopev2模型提取的人脸特征嵌入,与`bboxes`一一对应。
## 示例使用方法
本节提供了快速上手使用本数据集训练的WithAnyone模型的指南。
### 依赖环境
可通过`pip install -r requirements.txt`安装所需依赖包。
### Gradio图形界面演示
Gradio图形界面演示是体验WithAnyone模型的便捷入门方式,启动命令如下:
bash
python gradio_app.py --flux_path <FLUX1-dev模型目录路径> --ipa_path <WithAnyone模型目录路径>
--clip_path <CLIP-ViT-Large-Patch14模型目录路径>
--t5_path <XFLUX文本编码器目录路径>
--siglip_path <SIGLIP-Base-Patch16-256-I18N模型目录路径>
--model_type "flux-dev" # 或 "flux-kontext" 对应WithAnyone.K
❗ WithAnyone需要人脸边界框(bboxes)。你需要提供边界框以指定人脸的位置。你可以通过两种方式提供人脸边界框:
1. 在「遮罩配置(选项1:自动)」中上传包含目标人脸位置的示例图像,系统将自动提取人脸边界框,并在相同位置生成人脸。若上传图像的分辨率或宽高比与需求不符,无需担心,系统将自动调整边界框尺寸。
2. 在「遮罩配置(选项2:手动)」中直接输入人脸边界框,格式为每行一组`x1,y1,x2,y2`,对应单张人脸的边界坐标。
3. <span style="color: #999;">(不推荐)留空两种配置选项,系统将从预定义集合中随机选取人脸边界框。</span>
⭕ WithAnyone可与低秩自适应(Low-Rank Adaptation, LoRA)良好适配。若你拥有风格化LoRA检查点,可在启动演示时添加参数`--additional_lora_ckpt <LoRA检查点路径>`,LoRA将被融合至扩散模型中。
bash
python gradio_app.py --flux_path <FLUX1-dev模型目录路径> --ipa_path <WithAnyone模型目录路径>
--additional_lora_ckpt <LoRA检查点路径>
--lora_scale 0.8 # 可根据需求调整权重
### 批量推理
你可使用`infer_withanyone.py`进行批量推理,该脚本支持基于MultiID-Bench生成多张图像。
首先,下载MultiID-Bench:
bash
huggingface-cli download WithAnyone/MultiID-Bench --repo-type dataset --local-dir <MultiID-Bench数据集目录路径>
并使用`MultiID_Bench/parquet2bench.py`将Parquet文件转换为图像文件夹与JSON文件:
bash
python MultiID_Bench/parquet2bench.py --parquet <Parquet文件路径> --output_dir <输出目录路径>
你将得到如下结构的输出目录:
<输出目录>/
├── p1/untar
├── p2/untar
├── p3/
├── p1.json
├── p2.json
└── p3.json
随后运行批量推理命令:
bash
python infer_withanyone.py
--eval_json_path <MultiID-Bench子集JSON文件路径>
--data_root <MultiID-Bench子集图像目录路径>
--save_path <结果保存路径>
--use_matting True # 当siglip_weight > 0.0时设为True
--siglip_weight 0.0 # 「神似」与「形似」的权衡,数值越高则与参考图像越相似
--id_weight 1.0 # 通常设为1 - id_weight,数值越高则可控性越强
--t5_path <XFLUX文本编码器目录路径>
--clip_path <CLIP-ViT-Large-Patch14模型目录路径>
--ipa_path <WithAnyone模型目录路径>
--flux_path <FLUX1-dev模型目录路径>
其中`data_root`应根据评估子集选择为`p1/untar`、`p2/untar`或`p3/`,`eval_json_path`应对应从Parquet文件转换得到的对应JSON文件。
### 基于FLUX.1 Kontext与WithAnyone.Ke的人脸编辑
你可使用`gradio_edit.py`进行基于FLUX.1 Kontext与WithAnyone.Ke的人脸编辑,启动命令如下:
bash
python gradio_edit.py --flux_path <FLUX1-dev模型目录路径> --ipa_path <WithAnyone模型目录路径>
--clip_path <CLIP-ViT-Large-Patch14模型目录路径>
--t5_path <XFLUX文本编码器目录路径>
--siglip_path <SIGLIP-Base-Patch16-256-I18N模型目录路径>
--model_type "flux-kontext"
## 许可证与免责声明
本数据集仅可用于非商业性学术研究。访问或使用本数据集即代表您同意遵守[LICENSE](./LICENSE.md)中的条款。
- **无所有权主张**:本项目不对数据集中的原始图像、元数据或其他内容主张所有权。版权及其他相关权利仍归原权利所有人所有。
- **用户责任**:用户需确保其对数据集的使用符合所有适用的法律、法规及第三方条款(包括平台政策)。
- **下架/更正请求**:若权利所有人认为数据集中的内容侵犯了其权利,请通过[HuggingFace数据集页面](https://huggingface.co/datasets/WithAnyone/MultiID-2M)或[项目页面](https://doby-xu.github.io/WithAnyone/)提交下架或更正请求,并提供足够的所有权证明以及具体的标识符/URL。在验证有效请求后,我们将尽快移除或更正受影响的内容。
- **无担保;责任限制**:本数据集按「现状」提供,不附带任何形式的担保。本项目及维护者不对因使用数据集而产生的任何直接、间接、附带或衍生损失承担责任。
- **禁止商业使用**:未经数据集维护者另行许可,禁止商业使用本数据集;未经授权的商业使用可能会导致法律责任。
- **联系方式**:可通过HuggingFace数据集页面或项目网站提交请求或咨询问题。
提供机构:
vcbe123



