MultiID-2M
收藏魔搭社区2025-12-05 更新2025-11-03 收录
下载链接:
https://modelscope.cn/datasets/WithAnyone/MultiID-2M
下载链接
链接失效反馈官方服务:
资源简介:
# MultiID-2M
[](https://arxiv.org/abs/2510.14975)
[](https://doby-xu.github.io/WithAnyone/)
[](https://huggingface.co/WithAnyone/WithAnyone)
[](https://huggingface.co/datasets/WithAnyone/MultiID-2M)
[](https://huggingface.co/datasets/WithAnyone/MultiID-Bench)
[](https://github.com/Doby-Xu/WithAnyone)
<p align="center">
<img src="https://github.com/Doby-Xu/WithAnyone/blob/main/assets/withanyone.gif?raw=true" alt="WithAnyone in action" width="800"/>
</p>
This repository contains the **MultiID-2M** dataset, a large-scale paired dataset specifically constructed for multi-person scenarios in identity-consistent image generation. It provides diverse references for each identity, enabling the development of advanced diffusion-based models like WithAnyone, which aim to mitigate "copy-paste" artifacts and improve controllability over pose and expression in generated images.
- **Paper:** [WithAnyone: Towards Controllable and ID Consistent Image Generation](https://huggingface.co/papers/2510.14975)
- **Code:** [https://github.com/Doby-Xu/WithAnyone](https://github.com/Doby-Xu/WithAnyone)
- **Project Page:** [https://doby-xu.github.io/WithAnyone/](https://doby-xu.github.io/WithAnyone/)
## Paper Abstract
The abstract of the paper is the following:
Identity-consistent generation has become an important focus in text-to-image research, with recent models achieving notable success in producing images aligned with a reference identity. Yet, the scarcity of large-scale paired datasets containing multiple images of the same individual forces most approaches to adopt reconstruction-based training. This reliance often leads to a failure mode we term copy-paste, where the model directly replicates the reference face rather than preserving identity across natural variations in pose, expression, or lighting. Such over-similarity undermines controllability and limits the expressive power of generation. To address these limitations, we (1) construct a large-scale paired dataset MultiID-2M, tailored for multi-person scenarios, providing diverse references for each identity; (2) introduce a benchmark that quantifies both copy-paste artifacts and the trade-off between identity fidelity and variation; and (3) propose a novel training paradigm with a contrastive identity loss that leverages paired data to balance fidelity with diversity. These contributions culminate in WithAnyone, a diffusion-based model that effectively mitigates copy-paste while preserving high identity similarity. Extensive qualitative and quantitative experiments demonstrate that WithAnyone significantly reduces copy-paste artifacts, improves controllability over pose and expression, and maintains strong perceptual quality. User studies further validate that our method achieves high identity fidelity while enabling expressive controllable generation.
| <img src="assets/stat1.jpg" width="100%"> | <img src="assets/stat2.jpg" width="83%"> |
|:--:|:--:|
## Download
Currently, 1M images and their metadata are available for download.
[HuggingFace Dataset](https://huggingface.co/datasets/WithAnyone/MultiID-2M)
## File Structure
```
MultiID-2M/
├── ref/
│ ├── cluster_centers.tar
│ └── tars/ # reference tars
│ ├── ...
│
├── train_rec/ # reconstruction training data
│ ├── re_000000.tar
│ ├── re_000001.tar
│ └── ...
│
└── train_cp/ # identifiable paired data
├── re_000000.tar
├── re_000001.tar
└── ...
```
- `ref/cluster_centers.tar`: Contains the cluster centers of all the identifiable identities in the dataset.
- `ref/tars`: Contains the reference images for each identifiable identity.
- `train_cp`: Contains the training images only of the identifiable identities.
- `train_rec`: Contains the training images of both identifiable and unidentifiable identities.
## Labels
The dataset contains dense labels for each image, including:
- `url`: The original URL of the original image.
- `ram_score`: Scores from recognize anything model.
- `bboxes`: Bounding boxes of detected faces.
- `aesthetics_score`: Aesthetic score of the image.
- `caption_en`: English caption generated by VLMs.
- `name`: ID number of the identifiable identity (if identifiable, otherwise `none`).
- `embeddings` (or `embedding`): Face embeddings extracted using ArcFace antelopev2 model. This corresponds to the bboxes.
## Sample Usage
This section provides instructions for quickly getting started with the `WithAnyone` model, which can be trained using this dataset.
### Requirements
Use `pip install -r requirements.txt` to install the necessary packages.
### Gradio Demo
The Gradio GUI demo is a good starting point to experiment with WithAnyone. Run it with:
```bash
python gradio_app.py --flux_path <path to flux1-dev directory> --ipa_path <path to withanyone directory> \
--clip_path <path to clip-vit-large-patch14> \
--t5_path <path to xflux_text_encoders> \
--siglip_path <path to siglip-base-patch16-256-i18n> \
--model_type "flux-dev" # or "flux-kontext" for WithAnyone.K
```
❗ WithAnyone requires face bounding boxes (bboxes). You should provide them to indicate where faces are. You can provide face bboxes in two ways:
1. Upload an example image with desired face locations in `Mask Configuration (Option 1: Automatic)`. The face bboxes will be extracted automatically, and faces will be generated in the same locations. Do not worry if the given image has a different resolution or aspect ratio; the face bboxes will be resized accordingly.
2. Input face bboxes directly in `Mask Configuration (Option 2: Manual)`. The format is `x1,y1,x2,y2` for each face, one per line.
3. <span style="color: #999;">(NOT recommended) leave both options empty, and the face bboxes will be randomly chosen from a pre-defined set. </span>
⭕ WithAnyone works well with LoRA. If you have any stylized LoRA checkpoints, use `--additional_lora_ckpt <path to lora checkpoint>` when launching the demo. The LoRA will be merged into the diffusion model.
```bash
python gradio_app.py --flux_path <path to flux1-dev directory> --ipa_path <path to withanyone directory> \
--additional_lora_ckpt <path to lora checkpoint> \
--lora_scale 0.8 # adjust the weight as needed
```
### Batch Inference
You can use `infer_withanyone.py` for batch inference. The script supports generating multiple images with MultiID-Bench.
First, download MultiID-Bench:
```bash
huggingface-cli download WithAnyone/MultiID-Bench --repo-type dataset --local-dir <path to MultiID-Bench directory>
```
And convert the parquet file to a folder of images and a json file using `MultiID_Bench/parquet2bench.py`:
```bash
python MultiID_Bench/parquet2bench.py --parquet <path to parquet file> --output_dir <path to output directory>
```
You will get a folder with the following structure:
```
<output_dir>/
├── p1/untar
├── p2/untar
├── p3/
├── p1.json
├── p2.json
└── p3.json
```
Then run batch inference with:
```bash
python infer_withanyone.py \
--eval_json_path <path to MultiID-Bench subset json> \
--data_root <path to MultiID-Bench subset images> \
--save_path <path to save results> \
--use_matting True \ # set to True when siglip_weight > 0.0
--siglip_weight 0.0 \ # Resemblance in Spirit vs Resemblance in Form, higher means more similar to reference
--id_weight 1.0 \ # usually, set it to 1 - id_weight, higher means more controllable
--t5_path <path to xflux_text_encoders> \
--clip_path <path to clip-vit-large-patch14> \
--ipa_path <path to withanyone> \
--flux_path <path to flux1-dev>
```
Where the `data_root` should be `p1/untar`, `p2/untar`, or `p3/` depending on which subset you want to evaluate. The `eval_json_path` should be the corresponding json file converted from the parquet file.
### Face Edit with FLUX.1 Kontext
You can use `gradio_edit.py` for face editing with FLUX.1 Kontext and WithAnyone.Ke.
```bash
python gradio_edit.py --flux_path <path to flux1-dev directory> --ipa_path <path to withanyone directory> \
--clip_path <path to clip-vit-large-patch14> \
--t5_path <path to xflux_text_encoders> \
--siglip_path <path to siglip-base-patch16-256-i18n> \
--model_type "flux-kontext"
```
## License and Disclaimer
This dataset is provided for non-commercial academic research purposes only. By accessing or using this dataset you agree to the terms in the [LICENSE](./LICENSE.md).
- **No ownership claim**: The project does not claim ownership of the original images, metadata, or other content included in this dataset. Copyright and other rights remain with the original rights holders.
- **User responsibility**: Users are responsible for ensuring their use of the dataset complies with all applicable laws, regulations, and third‑party terms (including platform policies).
- **Takedown / correction requests**: If a rights holder believes content in this dataset infringes their rights, please submit a removal or correction request via the [HuggingFace dataset page](https://huggingface.co/datasets/WithAnyone/MultiID-2M) or the [project page](https://doby-xu.github.io/WithAnyone/), including sufficient proof of ownership and specific identifiers/URLs. After verification of a valid claim, we will remove or correct the affected items as soon as reasonably practicable.
- **No warranty; limitation of liability**: The dataset is provided "as is" without warranties of any kind. The project and maintainers disclaim liability for any direct, indirect, incidental, or consequential damages arising from use of the dataset.
- **Prohibited commercial use**: Commercial use is prohibited unless you obtain separate permission from the dataset maintainers; unauthorized commercial use may result in legal liability.
- **Contact**: Use the HuggingFace dataset page or the project website to submit requests or questions.
# MultiID-2M
[](https://arxiv.org/abs/2510.14975)
[](https://doby-xu.github.io/WithAnyone/)
[](https://huggingface.co/WithAnyone/WithAnyone)
[](https://huggingface.co/datasets/WithAnyone/MultiID-2M)
[](https://huggingface.co/datasets/WithAnyone/MultiID-Bench)
[](https://github.com/Doby-Xu/WithAnyone)
<p align="center">
<img src="https://github.com/Doby-Xu/WithAnyone/blob/main/assets/withanyone.gif?raw=true" alt="运行中的WithAnyone" width="800"/>
</p>
本仓库包含**MultiID-2M**数据集,这是一个专为身份一致图像生成中的多人物场景构建的大规模配对数据集。该数据集为每个身份提供多样化参考样本,可用于开发诸如WithAnyone这类先进的基于扩散模型的生成方法,旨在缓解“复制粘贴”伪影问题,并提升生成图像中姿态与表情的可控性。
- **论文**:[WithAnyone: 实现可控且身份一致的图像生成](https://huggingface.co/papers/2510.14975)
- **代码**:[https://github.com/Doby-Xu/WithAnyone](https://github.com/Doby-Xu/WithAnyone)
- **项目主页**:[https://doby-xu.github.io/WithAnyone/](https://doby-xu.github.io/WithAnyone/)
## 论文摘要
本文的摘要如下:
身份一致的图像生成已成为文本到图像生成领域的重要研究方向,近期的多款模型已在生成与参考身份对齐的图像方面取得了显著进展。然而,包含同一主体多张图像的大规模配对数据集的匮乏,迫使绝大多数方法采用基于重建的训练方式。这种依赖往往会导致一种被我们称为“复制粘贴”的失效模式:模型直接复刻参考人脸,而非在姿态、表情或光照的自然变化下维持身份一致性。这种过度相似性会破坏可控性,并限制生成模型的表达能力。为解决这些局限,我们(1)构建了专为多人物场景设计的大规模配对数据集MultiID-2M,为每个身份提供多样化参考样本;(2)提出了一个可量化“复制粘贴”伪影、并评估身份保真度与多样性间权衡关系的基准测试集;(3)引入了一种基于配对数据的对比性身份损失训练范式,以平衡身份保真度与生成多样性。这些研究成果最终催生了WithAnyone模型——一款基于扩散模型的生成方法,可在维持高身份相似度的同时,有效缓解“复制粘贴”伪影问题。大量定性与定量实验表明,WithAnyone可显著减少“复制粘贴”伪影,提升姿态与表情的可控性,并保持优异的感知质量。用户研究进一步验证了我们的方法在实现高身份保真度的同时,可支持富有表现力的可控生成任务。
| <img src="assets/stat1.jpg" width="100%"> | <img src="assets/stat2.jpg" width="83%"> |
|:--:|:--:|
## 下载
目前,已有100万张图像及其元数据可供下载。
[HuggingFace数据集](https://huggingface.co/datasets/WithAnyone/MultiID-2M)
## 文件结构
MultiID-2M/
├── ref/
│ ├── cluster_centers.tar
│ └── tars/ # 参考图像压缩包
│ ├── ...
│
├── train_rec/ # 重建训练数据
│ ├── re_000000.tar
│ ├── re_000001.tar
│ └── ...
│
└── train_cp/ # 可识别身份配对训练数据
├── re_000000.tar
├── re_000001.tar
└── ...
- `ref/cluster_centers.tar`:包含数据集中所有可识别身份的聚类中心。
- `ref/tars`:包含每个可识别身份的参考图像。
- `train_cp`:仅包含可识别身份的训练图像。
- `train_rec`:包含可识别与不可识别身份的训练图像。
## 标签
本数据集为每张图像提供了丰富的标签信息,包括:
- `url`:原始图像的来源URL。
- `ram_score`:Recognize Anything模型生成的评分。
- `bboxes`:检测到的人脸边界框。
- `aesthetics_score`:图像的美学评分。
- `caption_en`:由视觉语言模型(VLM, Visual Language Model)生成的英文描述文本。
- `name`:可识别身份的编号(若不可识别则为`none`)。
- `embeddings`(或`embedding`):使用ArcFace antelopev2模型提取的人脸特征嵌入,与`bboxes`中的边界框一一对应。
## 示例用法
本节介绍如何快速上手使用本数据集训练的WithAnyone模型。
### 环境依赖
执行`pip install -r requirements.txt`命令以安装所需依赖包。
### Gradio演示
Gradio图形界面演示是体验WithAnyone模型的便捷入口,启动命令如下:
bash
python gradio_app.py --flux_path <path to flux1-dev directory> --ipa_path <path to withanyone directory>
--clip_path <path to clip-vit-large-patch14>
--t5_path <path to xflux_text_encoders>
--siglip_path <path to siglip-base-patch16-256-i18n>
--model_type "flux-dev" # 或 "flux-kontext" 对应WithAnyone.K
❗ WithAnyone模型需要人脸边界框(bboxes)来指定人脸位置,可通过以下三种方式提供:
1. 在`遮罩配置(选项1:自动)`中上传包含目标人脸位置的示例图像,系统将自动提取人脸边界框,并在对应位置生成人脸。若上传图像的分辨率或宽高比与生成要求不符,系统会自动调整边界框尺寸。
2. 在`遮罩配置(选项2:手动)`中直接输入人脸边界框,格式为每行一个`x1,y1,x2,y2`的坐标字符串。
3. <span style="color: #999;">(不推荐)同时留空两个选项,系统将从预定义集合中随机选择人脸边界框。</span>
⭕ WithAnyone模型可与LoRA完美适配。若您有风格化LoRA权重文件,可在启动演示时添加`--additional_lora_ckpt <LoRA权重文件路径>`参数,系统将自动将LoRA融合至扩散模型中。
bash
python gradio_app.py --flux_path <path to flux1-dev directory> --ipa_path <path to withanyone directory>
--additional_lora_ckpt <path to lora checkpoint>
--lora_scale 0.8 # 可根据需求调整权重大小
### 批量推理
您可通过`infer_withanyone.py`脚本执行批量推理,该脚本支持基于MultiID-Bench生成多张图像。
首先,下载MultiID-Bench数据集:
bash
huggingface-cli download WithAnyone/MultiID-Bench --repo-type dataset --local-dir <path to MultiID-Bench directory>
接着,使用`MultiID_Bench/parquet2bench.py`脚本将Parquet文件转换为图像文件夹与JSON文件:
bash
python MultiID_Bench/parquet2bench.py --parquet <path to parquet file> --output_dir <path to output directory>
转换后将得到如下结构的文件夹:
<output_dir>/
├── p1/untar
├── p2/untar
├── p3/
├── p1.json
├── p2.json
└── p3.json
随后执行以下命令完成批量推理:
bash
python infer_withanyone.py
--eval_json_path <path to MultiID-Bench subset json>
--data_root <path to MultiID-Bench subset images>
--save_path <path to save results>
--use_matting True # 当siglip_weight > 0.0时设为True
--siglip_weight 0.0 # 精神相似度与外形相似度的权衡,数值越高则与参考图像越相似
--id_weight 1.0 # 通常设置为1 - id_weight,数值越高则可控性越强
--t5_path <path to xflux_text_encoders>
--clip_path <path to clip-vit-large-patch14>
--ipa_path <path to withanyone>
--flux_path <path to flux1-dev>
其中`data_root`需根据评估子集选择为`p1/untar`、`p2/untar`或`p3/`,`eval_json_path`需对应从Parquet文件转换得到的JSON文件。
### 基于FLUX.1 Kontext的人脸编辑
您可通过`gradio_edit.py`脚本使用FLUX.1 Kontext与WithAnyone.Ke实现人脸编辑功能,启动命令如下:
bash
python gradio_edit.py --flux_path <path to flux1-dev directory> --ipa_path <path to withanyone directory>
--clip_path <path to clip-vit-large-patch14>
--t5_path <path to xflux_text_encoders>
--siglip_path <path to siglip-base-patch16-256-i18n>
--model_type "flux-kontext"
## 许可与免责声明
本数据集仅用于非商业性学术研究,访问或使用本数据集即代表您同意遵守[LICENSE](./LICENSE.md)中的条款。
- **无所有权主张**:本项目不对数据集中的原始图像、元数据或其他内容主张任何所有权,版权及其他相关权利归原权利所有人所有。
- **用户责任**:用户需确保其对数据集的使用符合所有适用法律、法规及第三方条款(包括平台政策)。
- **下架/更正请求**:若权利所有人认为数据集中的内容侵犯了其合法权益,请通过[HuggingFace数据集页面](https://huggingface.co/datasets/WithAnyone/MultiID-2M)或[项目主页](https://doby-xu.github.io/WithAnyone/)提交下架或更正请求,并提供足够的所有权证明以及具体的标识符/URL。经核实有效后,我们将尽快移除或更正受影响的内容。
- **无担保与责任限制**:本数据集按“现状”提供,不附带任何形式的担保。本项目及其维护者不对因使用数据集而产生的任何直接、间接、附带及继发损害承担责任。
- **禁止商业使用**:未经数据集维护者单独许可,禁止任何商业性使用;未经授权的商业使用可能会引发法律责任。
- **联系方式**:可通过HuggingFace数据集页面或项目网站提交请求或咨询问题。
提供机构:
maas
创建时间:
2025-10-21



