five

Nexus-Gen训练数据集

收藏
魔搭社区2026-06-07 更新2025-07-26 收录
下载链接:
https://modelscope.cn/datasets/DiffSynth-Studio/Nexus-Gen-Training-Dataset
下载链接
链接失效反馈
官方服务:
资源简介:
# Nexus-Gen Training Dataset ## Intruduction This dataset contains the complete training data for [Nexus-Gen](https://www.modelscope.cn/models/DiffSynth-Studio/Nexus-GenV2), covering tasks including image understanding, generation, and editing. For details about training Nexus-Gen, please refer to the [tech report](https://arxiv.org/abs/2504.21356) and [github repo](https://github.com/modelscope/Nexus-Gen). All annotations are unified into the [standard messages format](https://swift.readthedocs.io/en/latest/Customization/Custom-dataset.html), stored in JSONL files. A single annotation example includes: * `images`: List of relative image paths, e.g. `["t2i/Flux/000055/000055_495587.png"]` * `messages`: Messages for training the language model, e.g. ```json [ {"role": "user", "content": "Generate an image according to the following description: A charming cartoon mouse with a light brown fur coat stands upright on its hind legs."}, {"role": "assistant", "content": "Here is an image based on the description: <image>"} ] ``` * id: Unique data identifier, e.g. `"EliGen_495587"` * source: Sample source, e.g. `"EliGen"` The total dataset size is 26.3M samples, with the following distribution: ![Distribution](asserts/dataset_distribution.jpg) To facilitate model training and open-source usage, we further partition the annotations into stage-level annotations and task-level annotations. Since annotation files are large, see ./previews for more sample annotations. ### Stage-Level Annotations 1. `autoregressive_model_pretraining_26.3M.jsonl`: Includes all 26.3 million training samples for pretraining the autoregressive model. 2. `autoregressive_model_aes_finetuning_4.3M.jsonl`: This subset is used for quality fine-tuning of the autoregressive model. Image generation samples are restricted to high-quality subsets, while image editing samples primarily come from the ImagePulse subset. 3. `generation_decoder_1.7M.jsonl`: Subset used to train Nexus-Gen’s generation decoder, containing only high-quality image generation samples. 4. `editing_decoder_0.85M.jsonl`: Subset used to train Nexus-Gen’s editing decoder, exclusively sourced from the ImagePulse image editing dataset. ### Task Level Annotations 1. `image_understanding_5.8M.jsonl`: This task is structured with multimodal inputs (image-text pairs) and text-only outputs, which serves as a direct indicator of model’s chat and understanding ability. While MLLMs inherently possess such cross-modal reasoning capabilities, this task is still critical during training to prevent capacity degradation. We adopt [Cambrian-7M](https://huggingface.co/datasets/nyu-visionx/Cambrian-10M) as the data source, a comprehensive dataset spanning multiple domains including optical character recognition, general visual question answering, language, counting, code, math and science tasks. To improve data quality, we re-annotate the answers for all samples with Qwen2.5-VL-72B. 2. `image_generation_13.3M.jsonl`: The input for this task is the textual description, and the output is an image. Our data sources comprise [Journey DB](https://huggingface.co/datasets/JourneyDB/JourneyDB), [AnyWord](https://modelscope.cn/datasets/iic/AnyWord-3M/summary), [Laion-High-Resolution](https://huggingface.co/datasets/laion/laion-high-resolution), [EliGen TrainSet](https://modelscope.cn/datasets/DiffSynth-Studio/EliGenTrainSet), [FLUX-Aes](https://huggingface.co/datasets/gogoduan/flux_laion_aes), [FLUX-T2I](https://huggingface.co/datasets/jackyhate/text-to-image-2M) and [Blip3o-60K](https://huggingface.co/datasets/BLIP3o/BLIP3o-60k). To enhance annotation diversity, we employ a dual-captioning paradigm via Qwen2.5-VL-72B, generating both concise captions and elaborate descriptions for each image. During training, we stochastically sample these annotations with stratified ratios (20\% concise vs. 80\% elaborate) to balance brevity and contextual granularity. 3. `image_editing_6.3M.jsonl`: The input for editing task consists of an image and its corresponding editing instruction, and the output denotes the edited image. Our data sources encompass datasets such as [HQ-Edit](https://huggingface.co/datasets/UCSC-VLAA/HQ-Edit), [UltraEdit](https://huggingface.co/datasets/BleachNick/UltraEdit), [OmniEdit](https://huggingface.co/datasets/TIGER-Lab/OmniEdit-Filtered-1.2M), [StyleBooth](https://ali-vilab.github.io/stylebooth-page/) and [ImagePulse](https://modelscope.cn/collections/ImagePulse----tulvmaidong-7c3b8283a43e40). ## How To Use We provide a script to unzip the images from tar files to subfolders in `./images`: ```python import os import subprocess from tqdm import tqdm import multiprocessing from functools import partial source_tar_dir = "tars" target_image_dir = "images" sub_dirs = ['vqa', 'edit', 't2i/Anyword', 't2i/Blip3o', 't2i/Flux', 't2i/JourneyDB', 't2i/Laion'] tar_bases = [os.path.join(source_tar_dir, sub_dir) for sub_dir in sub_dirs] image_bases = [os.path.join(target_image_dir, sub_dir) for sub_dir in sub_dirs] for img_dir in image_bases: os.makedirs(img_dir, exist_ok=True) def unpack_tar(tar_path, dest_dir): cmd = [ "tar", "xf", tar_path, "-C", dest_dir ] subprocess.run(cmd, check=True, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL) return os.path.join(dest_dir, os.path.basename(tar_path).replace('.tar', '')) all_tar_files = [] for tar_base, img_base in zip(tar_bases, image_bases): files = [f for f in os.listdir(tar_base) if f.endswith('.tar')] all_tar_files.extend([(os.path.join(tar_base, f), img_base) for f in files]) def process_tar(args): return unpack_tar(args[0], args[1]) num_processes = min(50, multiprocessing.cpu_count()) with multiprocessing.Pool() as pool: results = [] with tqdm(total=len(all_tar_files), desc="Unpacking TAR files") as pbar: for res in pool.imap_unordered(process_tar, all_tar_files): pbar.update(1) results.append(res) print(f"\nAll TAR files unpacked to: {target_image_dir}") print(f"Total unpacked folders: {len(results)}") ``` Furthermore, you can verify file correctness using the following code and inject your absolute path into the annotations: ```python import os import random import json from tqdm import tqdm def read_jsonl(file_path, num_samples=None): print(f"reading from {file_path}") data_list = [] samples = 0 with open(file_path, 'r', encoding='utf-8') as file: for line in tqdm(file): data = json.loads(line.strip()) data_list.append(data) samples += 1 if num_samples is not None and samples >= num_samples: break print(f"read {len(data_list)} samples") return data_list def save_jsonl(data, file_path): dir_path = os.path.dirname(file_path) os.makedirs(dir_path, exist_ok=True) print(f'saving to {file_path}') with open(file_path, 'w', encoding='utf-8') as file: for item in tqdm(data, desc='Saving', total=len(data)): json.dump(item, file, ensure_ascii=False) file.write('\n') print(f'saved {len(data)} samples') # path to the images directory images_path = "path_to_your_data/images" # path to annotations input_jsonl = 'stage_level_annotations/editing_decoder_0.85M.jsonl' output_jsonl = 'stage_level_annotations/editing_decoder_0.85M_abs_path.jsonl' preview_jsonl = 'previews/editing_decoder_0.85M_abs_path_preview1k.jsonl' datas = read_jsonl(input_jsonl) for data in tqdm(datas): if 'images' in data: images = [os.path.join(images_path, image) for image in data['images']] data['images'] = images save_jsonl(datas, output_jsonl) random.shuffle(datas) preview_set = datas[:1000] for data in preview_set: if 'images' in data: for image in images: assert os.path.exists(image), f"Image path {image} does not exist" save_jsonl(preview_set, preview_jsonl) ``` ### Citation ``` @misc{zhang2025nexusgenunifiedimageunderstanding, title={Nexus-Gen: Unified Image Understanding, Generation, and Editing via Prefilled Autoregression in Shared Embedding Space}, author={Hong Zhang and Zhongjie Duan and Xingjun Wang and Yuze Zhao and Weiyi Lu and Zhipeng Di and Yixuan Xu and Yingda Chen and Yu Zhang}, year={2025}, eprint={2504.21356}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2504.21356}, } ```

# Nexus-Gen 训练数据集 ## 引言 本数据集为[Nexus-Gen](https://www.modelscope.cn/models/DiffSynth-Studio/Nexus-GenV2)提供完整训练数据,涵盖图像理解、生成与编辑三类任务。如需了解Nexus-Gen的训练细节,请参阅其[技术报告](https://arxiv.org/abs/2504.21356)及[GitHub仓库](https://github.com/modelscope/Nexus-Gen)。 所有标注均统一为**标准消息格式(standard messages format)**,存储于JSONL文件中。单条标注示例包含以下字段: * `images`:图像相对路径列表,例如`["t2i/Flux/000055/000055_495587.png"]` * `messages`:用于训练大语言模型(Large Language Model,LLM)的对话消息,例如: json [ {"role": "user", "content": "请根据以下描述生成图像:一只拥有浅棕色皮毛的可爱卡通小鼠直立站在后腿上。"}, {"role": "assistant", "content": "以下是基于该描述生成的图像:<image>"} ] * `id`:唯一数据标识符,例如`"EliGen_495587"` * `source`:样本来源,例如`"EliGen"` 本数据集总规模为2630万条样本,数据分布如下: ![Distribution](asserts/dataset_distribution.jpg) 为便于模型训练与开源使用,我们进一步将标注划分为**阶段级标注**与**任务级标注**。由于标注文件体积较大,更多样本标注示例请查看`./previews`目录。 ### 阶段级标注 1. `autoregressive_model_pretraining_26.3M.jsonl`:包含全部2630万条训练样本,用于自回归模型的预训练。 2. `autoregressive_model_aes_finetuning_4.3M.jsonl`:该子集用于自回归模型的质量微调。其中图像生成样本限定为高质量子集,图像编辑样本主要源自ImagePulse子集。 3. `generation_decoder_1.7M.jsonl`:用于训练Nexus-Gen生成解码器的子集,仅包含高质量图像生成样本。 4. `editing_decoder_0.85M.jsonl`:用于训练Nexus-Gen编辑解码器的子集,全部数据均源自ImagePulse图像编辑数据集。 ### 任务级标注 1. `image_understanding_5.8M.jsonl`:该任务采用多模态输入(图像-文本对)与纯文本输出,可直接反映模型的对话与理解能力。尽管多模态大语言模型(Multimodal Large Language Model,MLLM)天生具备此类跨模态推理能力,但该任务在训练中仍至关重要,可防止模型能力退化。我们采用[Cambrian-7M](https://huggingface.co/datasets/nyu-visionx/Cambrian-10M)作为数据源,该数据集涵盖光学字符识别、通用视觉问答、语言、计数、代码、数学与科学任务等多个领域。为提升数据质量,我们使用Qwen2.5-VL-72B对所有样本的答案进行了重新标注。 2. `image_generation_13.3M.jsonl`:该任务的输入为文本描述,输出为图像。我们的数据源包括[Journey DB](https://huggingface.co/datasets/JourneyDB/JourneyDB)、[AnyWord](https://modelscope.cn/datasets/iic/AnyWord-3M/summary)、[Laion-High-Resolution](https://huggingface.co/datasets/laion/laion-high-resolution)、[EliGen TrainSet](https://modelscope.cn/datasets/DiffSynth-Studio/EliGenTrainSet)、[FLUX-Aes](https://huggingface.co/datasets/gogoduan/flux_laion_aes)、[FLUX-T2I](https://huggingface.co/datasets/jackyhate/text-to-image-2M)以及[Blip3o-60K](https://huggingface.co/datasets/BLIP3o/BLIP3o-60k)。为增强标注多样性,我们采用Qwen2.5-VL-72B实现双标注范式,为每张图像生成简洁标题与详细描述。训练过程中,我们按分层比例随机采样这些标注(20%简洁标题 vs 80%详细描述),以平衡简洁性与上下文细节丰富度。 3. `image_editing_6.3M.jsonl`:该编辑任务的输入为图像及其对应的编辑指令,输出为编辑后的图像。我们的数据源包括[HQ-Edit](https://huggingface.co/datasets/UCSC-VLAA/HQ-Edit)、[UltraEdit](https://huggingface.co/datasets/BleachNick/UltraEdit)、[OmniEdit](https://huggingface.co/datasets/TIGER-Lab/OmniEdit-Filtered-1.2M)、[StyleBooth](https://ali-vilab.github.io/stylebooth-page/)以及[ImagePulse](https://modelscope.cn/collections/ImagePulse----tulvmaidong-7c3b8283a43e40)。 ## 使用方法 我们提供了一段脚本,用于将tar压缩包中的图像解压至`./images`目录下的子文件夹中: python import os import subprocess from tqdm import tqdm import multiprocessing from functools import partial source_tar_dir = "tars" target_image_dir = "images" sub_dirs = ['vqa', 'edit', 't2i/Anyword', 't2i/Blip3o', 't2i/Flux', 't2i/JourneyDB', 't2i/Laion'] tar_bases = [os.path.join(source_tar_dir, sub_dir) for sub_dir in sub_dirs] image_bases = [os.path.join(target_image_dir, sub_dir) for sub_dir in sub_dirs] for img_dir in image_bases: os.makedirs(img_dir, exist_ok=True) def unpack_tar(tar_path, dest_dir): cmd = [ "tar", "xf", tar_path, "-C", dest_dir ] subprocess.run(cmd, check=True, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL) return os.path.join(dest_dir, os.path.basename(tar_path).replace('.tar', '')) all_tar_files = [] for tar_base, img_base in zip(tar_bases, image_bases): files = [f for f in os.listdir(tar_base) if f.endswith('.tar')] all_tar_files.extend([(os.path.join(tar_base, f), img_base) for f in files]) def process_tar(args): return unpack_tar(args[0], args[1]) num_processes = min(50, multiprocessing.cpu_count()) with multiprocessing.Pool() as pool: results = [] with tqdm(total=len(all_tar_files), desc="Unpacking TAR files") as pbar: for res in pool.imap_unordered(process_tar, all_tar_files): pbar.update(1) results.append(res) print(f" All TAR files unpacked to: {target_image_dir}") print(f"Total unpacked folders: {len(results)}") 此外,你可以使用以下代码验证文件正确性,并将图像路径替换为本地绝对路径以注入标注文件中: python import os import random import json from tqdm import tqdm def read_jsonl(file_path, num_samples=None): print(f"reading from {file_path}") data_list = [] samples = 0 with open(file_path, 'r', encoding='utf-8') as file: for line in tqdm(file): data = json.loads(line.strip()) data_list.append(data) samples += 1 if num_samples is not None and samples >= num_samples: break print(f"read {len(data_list)} samples") return data_list def save_jsonl(data, file_path): dir_path = os.path.dirname(file_path) os.makedirs(dir_path, exist_ok=True) print(f'saving to {file_path}') with open(file_path, 'w', encoding='utf-8') as file: for item in tqdm(data, desc='Saving', total=len(data)): json.dump(item, file, ensure_ascii=False) file.write(' ') print(f'saved {len(data)} samples') # path to the images directory images_path = "path_to_your_data/images" # path to annotations input_jsonl = 'stage_level_annotations/editing_decoder_0.85M.jsonl' output_jsonl = 'stage_level_annotations/editing_decoder_0.85M_abs_path.jsonl' preview_jsonl = 'previews/editing_decoder_0.85M_abs_path_preview1k.jsonl' datas = read_jsonl(input_jsonl) for data in tqdm(datas): if 'images' in data: images = [os.path.join(images_path, image) for image in data['images']] data['images'] = images save_jsonl(datas, output_jsonl) random.shuffle(datas) preview_set = datas[:1000] for data in preview_set: if 'images' in data: for image in images: assert os.path.exists(image), f"Image path {image} does not exist" save_jsonl(preview_set, preview_jsonl) ## 引用 bibtex @misc{zhang2025nexusgenunifiedimageunderstanding, title={Nexus-Gen: Unified Image Understanding, Generation, and Editing via Prefilled Autoregression in Shared Embedding Space}, author={Hong Zhang and Zhongjie Duan and Xingjun Wang and Yuze Zhao and Weiyi Lu and Zhipeng Di and Yixuan Xu and Yingda Chen and Yu Zhang}, year={2025}, eprint={2504.21356}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2504.21356}, }
提供机构:
maas
创建时间:
2025-07-17
搜集汇总
数据集介绍
main_image_url
背景与挑战
背景概述
该数据集是用于训练Nexus-Gen模型的完整训练数据,涵盖图像理解、生成和编辑三大任务。它包含2630万个样本,所有标注均以标准消息格式统一存储在JSONL文件中。
以上内容由遇见数据集搜集并总结生成
二维码
社区交流群
二维码
科研交流群
商业服务