MedTrinity-25M

Name: MedTrinity-25M
Creator: maas
Published: 2026-05-23 22:23:25
License: 暂无描述

魔搭社区2026-05-23 更新2024-08-31 收录

下载链接：

https://modelscope.cn/datasets/AI-ModelScope/MedTrinity-25M

下载链接

链接失效反馈

官方服务：

资源简介：

# Tutorial of using Medtrinity-25M MedTrinity-25M, a comprehensive, large-scale multimodal dataset for medicine, covering over 25 million images across 10 modalities, with multigranular annotations for more than 65 diseases. These enriched annotations encompass both global textual information, such as disease/lesion type, modality, region-specific descriptions, and inter-regional relationships, as well as detailed local annotations for regions of interest (ROIs), including bounding boxes, segmentation masks. Compared to existing datasets, MedTrinity-25M provides the most enriched annotations, supporting a comprehensive range of multimodal tasks such as captioning and report generation, as well as vision-centric tasks like classification and segmentation. This dataset can be utilized to support large-scale pre-training of multimodal medical AI models, contributing to the development of future foundation models in the medical domain. - **Homepage:** https://github.com/yunfeixie233/MedTrinity-25M - **Paperlink:** https://arxiv.org/abs/2408.02900 - **Github Repo:** https://github.com/UCSC-VLAA/MedTrinity-25M ## Subsets This dataset is divided into three subsets to accommodate different research needs and access limitations: 1. **Demo Subset** (~100k samples) - A smaller subset ideal for initial experimentation and quick model prototyping. - Contains approximately 100,000 samples. 2. **Full Dataset** (25M samples) *[Text-only]* - Contains the complete text content of Medtrinity-25M. - Due to access limitations, images are not included. 3. **Accessible Dataset** (18M samples) *[Recommended]* - Contains the publicly accessible portion of Medtrinity-25M. - Includes complete image-text pairs. - Most suitable for training purposes due to its completeness and accessibility. The link of Accessible Dataset is [Link](https://huggingface.co/UCSC-VLAA/MedTrinity-25M). ## Dataset Download and Preparation ## Deploy Accessible Dataset For datasets that are publicly accessible, follow these steps to download and prepare the data: 1. Download the dataset using the provided links or through the Hugging Face Datasets library: ### (Recommended minimum disk space: >2TB) ```bash git lfs install git clone https://huggingface.co/datasets/UCSC-VLAA/MedTrinity-25M cd ./MedTrinity-25M ``` 2. For downloaded files are compressed, use the following bash script to decompress them: ```bash bash ./toolkit/unzip.sh -k 16 -d -t /path/to/target -s ./25M_accessible # This script decompresses all .tar.zst files in the specified directory and its subdirectories. # # Options: # -k <num> Set the number of parallel decompression tasks (default: 16) # Higher values increase speed but require more disk space during decompression. # Each task may use up to 40GB of disk space. # Recommended maximum: (available_disk_space_GB // 40). # For systems with >2.5TB free space and >32 CPU cores, 67 is optimal for maximum speed.""" # -d Delete original files after successful decompression (default: do not delete) # -t <dir> Set the target directory for decompression (default: same as source file location) # -s <dir> Set the search directory for .tar.zst files (default: current directory) ``` ## Deploy Full Dataset (This tutorial is incompleted and under construction.) For datasets that are not publicly accessible, please follow the steps below to download and prepare each dataset: ### CheXpert 1. **Download the CheXpert dataset** from the following link: - [CheXpert Dataset](https://stanfordaimi.azurewebsites.net/datasets/8cbd9ed4-2eb9-4565-affc-111cf4f7ebe2) 2. **Download the corresponding masks** from CheXmask: - [CheXmask](https://github.com/ngaggion/CheXmask-Database) 3. **Overlay the masks onto the original images** in the CheXpert dataset by following the tutorial provided in the CheXmask repository. ### MIMIC-CXR-JPG 1. **Download the MIMIC-CXR-JPG dataset** from the following link: - [MIMIC-CXR-JPG Dataset](https://physionet.org/content/mimic-cxr-jpg/2.1.0/) 2. **Download the corresponding masks** from CheXmask: - [CheXmask](https://github.com/ngaggion/CheXmask-Database) 3. **Overlay the masks onto the original images** in the MIMIC-CXR-JPG dataset by following the tutorial provided in the CheXmask repository. ### PadChest 1. **Download the PadChest dataset** from the following link: - [PadChest Dataset](https://bimcv.cipf.es/bimcv-projects/padchest/) 2. **Download the corresponding masks** from CheXmask: - [CheXmask](https://github.com/ngaggion/CheXmask-Database) 3. **Overlay the masks onto the original images** in the PadChest dataset by following the tutorial provided in the CheXmask repository. ### Others To be updated. ## Dataset Structure and Usage The dataset is divided into shards, with each shard containing: - Image files - A `metadata.jsonl` file containing captions and source information for the images ### Metadata Structure The `metadata.jsonl` file contains the following fields for each image: - `file_name` (str): The filename of the image - `id` (str): UUID of the image - `caption` (str): Multi-granular description of the image, including disease/lesion type, modality, region-specific descriptions, and inter-regional relationships - `source` (str): Image dataset source ### Usage Notes The dataset is divided into shards to avoid exceeding file system limitations on the number of files in a single directory. To use all shards for training simultaneously, consider modifying your dataloader code to iterate through all folders and load the corresponding `metadata.jsonl` and image files sequentially. ## Dataset Availability and Licensing The following table summarizes the availability and licensing information for each dataset in Medtrinity-25M: | Dataset | Available | Link | License | |---------|-----------|------|---------| | BHX | Yes | [Link](https://physionet.org/content/bhx-brain-bounding-box/1.1/) | - | | BRATS24-MICCAI | Yes | [Link](https://www.synapse.org/Synapse:syn53708126) | CC BY 4.0 | | Breast Histopathology | Yes | [Link](https://www.kaggle.com/datasets/paultimothymooney/breast-histopathology-images) | CC0 1.0 Universal | | BreastCancer | Yes | [Link](https://zenodo.org/records/6633721) | CC BY 4.0 | | CheXpert | No | [Link](https://stanfordmlgroup.github.io/competitions/chexpert/) | Stanford University Dataset Research Use Agreement | | CISC | Yes | [Link](https://academictorrents.com/details/99f2c7b57b95500711e33f2ee4d14c9fd7c7366c) | CC BY-NC-SA 4.0 | | CPD | Yes | [Link](https://zenodo.org/records/7282326) | CC BY 4.0 | | CT-RATE | Yes | [Link](https://huggingface.co/datasets/ibrahimhamamci/CT-RATE) | CC BY-NC-SA 4.0 | | DeepLesion | Yes | [Link](https://huggingface.co/datasets/farrell236/DeepLesion) | CC BY 4.0 | | FLARE23 | Yes | [Link](https://codalab.lisn.upsaclay.fr/competitions/12239) | CC-BY-NC-SA 4.0 | | IHC4BC | Yes | [Link](https://www.kaggle.com/datasets/akbarnejad1991/ihc4bc-compressed/data) | Unknown | | KIPA22 | No | [Link](https://kipa22.grand-challenge.org/) | CC BY-NC-ND 3.0 | | LLaVA-Med | Yes | [Link](https://github.com/LMMMEng/LLD-MMRI-Dataset) | CC BY NC 4.0 | | MAMA-MIA | Yes | [Link](https://www.synapse.org/Synapse:syn60868042/wiki/628716) | CC BY NC 4.0 | | MIMIC-CXR-JPG | No | [Link](https://physionet.org/content/mimic-cxr-jpg/2.1.0/) | PhysioNet Credentialed Health Data License 1.5.0 | | NCT-CRC-HE-100K | Yes | [Link](https://www.kaggle.com/datasets/imrankhan77/nct-crc-he-100k) | CC0 1.0 Universal | | NIH-CXR | Yes | [Link](https://www.kaggle.com/datasets/nih-chest-xrays/data) | CC0 1.0 Universal | | PadChest | No | [Link](https://bimcv.cipf.es/bimcv-projects/padchest/) | PADCHEST Dataset Research Use Agreement | | PatchGastricADC22 | Yes | [Link](https://zenodo.org/records/6550925) | CC BY 4.0 | | Path-VQA | Yes | [Link](https://huggingface.co/datasets/flaviagiammarino/path-vqa) | MIT | | PMC-OA | Yes | [Link](https://huggingface.co/datasets/axiong/pmc_oa) | OpenRAIL | | PMC-VQA | Yes | [Link](https://huggingface.co/datasets/xmcmic/PMC-VQA) | CC BY-SA | | TCGA | Yes | [Link](https://zenodo.org/records/5889558) | CC-BY-NC-SA 4.0 | | PTCGA | Yes | [Link](https://drive.google.com/drive/folders/18CmL-WLyppK1Rk29CgV7ib5MACFzg5ei) | CC-BY-NC-SA 4.0 | | QUILT-1M | No | [Link](https://zenodo.org/records/8239942) | - | | SA-Med2D-20M | Yes | [Link](https://openxlab.org.cn/datasets/GMAI/SA-Med2D-20M) | - | | SLAKE | Yes | [Link](https://www.med-vqa.com/slake/) | CC BY 4.0 | | ULS23 | Yes | [Link](https://zenodo.org/records/10035161) | CC BY-NC-SA 4.0 | | VALSET | No | [Link](https://zenodo.org/records/7548828) | - | | VQA-RAD | Yes | [Link](https://osf.io/89kps/) | CC0 1.0 Universal | ## Disclaimer Medtrinity-25M is a compilation of publicly available datasets, created with the intent to contribute back to the community and provide researchers and developers with a resource for academic and technical research. Any individual or organization (hereinafter referred to as "User") utilizing this dataset must adhere to the following disclaimer: 1. **Dataset Origin**: This dataset is composed of multiple publicly available datasets, the sources of which are clearly identified in the preprint paper. Users are obligated to comply with the relevant licenses and terms of use of the original datasets. 2. **Data Accuracy**: While every effort has been made to ensure the accuracy and completeness of the dataset, we cannot guarantee its absolute accuracy. Users assume all risks and responsibilities associated with the use of this dataset. 3. **Limitation of Liability**: Under no circumstances shall the dataset providers or contributors be held liable for any actions or outcomes resulting from the User's utilization of this dataset. 4. **Usage Constraints**: Users must comply with applicable laws, regulations, and ethical standards when using this dataset. The dataset must not be used for illegal, privacy-infringing, defamatory, discriminatory, or other unlawful or unethical purposes. 5. **Intellectual Property**: The intellectual property rights of this dataset belong to the respective rights holders of the original datasets. Users shall not infringe upon the intellectual property rights of the dataset in any manner. As a non-profit organization, our team advocates for a harmonious and friendly open-source exchange environment. If you discover any content within the open-source dataset that infringes upon your legal rights, please send an email to （yxie126@ucsc.edu). In your email, please provide a detailed description of the alleged infringement and furnish us with relevant proof of ownership. We will initiate an investigation process within 3 working days and take necessary measures to address the issue (such as removing the relevant data). However, please ensure the veracity of your complaint, as any adverse consequences resulting from measures taken based on false claims will be solely your responsibility. By downloading, copying, accessing, or using this dataset, the User acknowledges that they have read, understood, and agreed to abide by all terms and conditions set forth in this disclaimer. If the User is unable to accept any part of this disclaimer, they should refrain from using this dataset. ## License and Citation Please respect the individual licenses for each dataset as specified in the table above. When using these datasets in your research, make sure to cite the original sources and comply with their respective terms of use. If you find MedTrinity-25M useful for your research and applications, please cite using this BibTeX: ```bibtex @misc{xie2024medtrinity25mlargescalemultimodaldataset, title={MedTrinity-25M: A Large-scale Multimodal Dataset with Multigranular Annotations for Medicine}, author={Yunfei Xie and Ce Zhou and Lang Gao and Juncheng Wu and Xianhang Li and Hong-Yu Zhou and Sheng Liu and Lei Xing and James Zou and Cihang Xie and Yuyin Zhou}, year={2024}, eprint={2408.02900}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2408.02900}, } ``` ---

# MedTrinity-25M 使用教程 MedTrinity-25M是一款全面的大规模医学多模态数据集（multimodal dataset），涵盖10种模态的超2500万张图像，并为超过65种疾病提供了多粒度标注。这些丰富的标注既包含全局文本信息，如疾病/病变类型、模态、区域特异性描述以及区域间关联，也包含针对感兴趣区域（Region of Interest, ROI）的详细局部标注，例如边界框（bounding box）、分割掩码（segmentation mask）。相较于现有数据集，MedTrinity-25M提供了最为丰富的标注类型，可支撑涵盖图像描述生成、报告生成在内的全系列多模态任务，以及分类、分割等以视觉为核心的任务。该数据集可用于支持多模态医学AI模型的大规模预训练，助力医学领域未来基础模型的研发。 - **主页：** https://github.com/yunfeixie233/MedTrinity-25M - **论文链接：** https://arxiv.org/abs/2408.02900 - **GitHub仓库：** https://github.com/UCSC-VLAA/MedTrinity-25M ## 数据集子集本数据集分为三个子集，以适配不同的研究需求与访问限制： 1. **演示子集（Demo Subset）（约10万样本）** - 规模较小的子集，适用于初始实验与快速模型原型开发。 - 包含约100,000个样本。 2. **完整数据集（Full Dataset）（2500万样本）*[仅文本版]* - 包含MedTrinity-25M的全部文本内容。 - 受限于访问权限，不包含图像数据。 3. **可访问数据集（Accessible Dataset）（1800万样本）*[推荐使用]* - 包含MedTrinity-25M的公开可访问部分。 - 包含完整的图像-文本对数据。 - 因数据完整且可公开获取，最适合用于模型训练。可访问数据集的链接为：[链接](https://huggingface.co/UCSC-VLAA/MedTrinity-25M) ## 数据集下载与准备 ### 部署可访问数据集对于公开可访问的数据集，请按照以下步骤下载并准备数据： 1. 通过提供的链接或使用Hugging Face Datasets库下载数据集： #### （推荐最小磁盘空间：>2TB） bash git lfs install git clone https://huggingface.co/datasets/UCSC-VLAA/MedTrinity-25M cd ./MedTrinity-25M 2. 若下载的文件为压缩包，请使用以下bash脚本进行解压： bash bash ./toolkit/unzip.sh -k 16 -d -t /path/to/target -s ./25M_accessible # 该脚本可解压指定目录及其子目录下的所有.tar.zst格式文件。 # # 参数说明： # -k <num> 设置并行解压任务数（默认值：16） # 数值越高解压速度越快，但解压过程中需要更多磁盘空间。 # 每个任务最多占用40GB磁盘空间。 # 推荐最大值：(可用磁盘空间_GB // 40)。 # 对于可用空闲空间>2.5TB且CPU核心数>32的系统，设置为67可达到最优解压速度。 # -d 解压成功后删除原始压缩文件（默认不删除） # -t <dir> 设置解压目标目录（默认值：与源文件所在目录一致） # -s <dir> 设置待解压.tar.zst文件的搜索目录（默认值：当前目录） ### 部署完整数据集（本教程尚未完成，仍在开发中）对于非公开可访问的数据集，请按照以下步骤下载并准备各子数据集： #### CheXpert 1. **下载CheXpert数据集**，请访问以下链接： - [CheXpert数据集](https://stanfordaimi.azurewebsites.net/datasets/8cbd9ed4-2eb9-4565-affc-111cf4f7ebe2) 2. **从CheXmask下载对应的掩码**： - [CheXmask](https://github.com/ngaggion/CheXmask-Database) 3. **按照CheXmask仓库中的教程，将掩码叠加至CheXpert数据集中的原始图像上**。 #### MIMIC-CXR-JPG 1. **下载MIMIC-CXR-JPG数据集**，请访问以下链接： - [MIMIC-CXR-JPG数据集](https://physionet.org/content/mimic-cxr-jpg/2.1.0/) 2. **从CheXmask下载对应的掩码**： - [CheXmask](https://github.com/ngaggion/CheXmask-Database) 3. **按照CheXmask仓库中的教程，将掩码叠加至MIMIC-CXR-JPG数据集中的原始图像上**。 #### PadChest 1. **下载PadChest数据集**，请访问以下链接： - [PadChest数据集](https://bimcv.cipf.es/bimcv-projects/padchest/) 2. **从CheXmask下载对应的掩码**： - [CheXmask](https://github.com/ngaggion/CheXmask-Database) 3. **按照CheXmask仓库中的教程，将掩码叠加至PadChest数据集中的原始图像上**。 #### 其他数据集待更新。 ## 数据集结构与使用方法本数据集被划分为多个分片（shard），每个分片包含： - 图像文件 - 一个`metadata.jsonl`文件，包含图像的描述文本与来源信息 ### 元数据结构 `metadata.jsonl`文件为每张图像包含以下字段： - `file_name`（字符串）：图像文件名 - `id`（字符串）：图像的UUID - `caption`（字符串）：图像的多粒度描述，涵盖疾病/病变类型、模态、区域特异性描述以及区域间关联 - `source`（字符串）：图像数据集来源 ### 使用注意事项为避免超出单个目录的文件数量限制，数据集被划分为多个分片。若需同时加载所有分片用于训练，请修改数据加载器代码，使其遍历所有文件夹并依次加载对应的`metadata.jsonl`文件与图像文件。 ## 数据集可用性与许可协议下表总结了MedTrinity-25M中各子数据集的可用性与许可信息： | 数据集名称 | 可用状态 | 链接 | 许可协议 | |---------|-----------|------|---------| | BHX | 是 | [链接](https://physionet.org/content/bhx-brain-bounding-box/1.1/) | - | | BRATS24-MICCAI | 是 | [链接](https://www.synapse.org/Synapse:syn53708126) | CC BY 4.0 | | Breast Histopathology | 是 | [链接](https://www.kaggle.com/datasets/paultimothymooney/breast-histopathology-images) | CC0 1.0 Universal | | BreastCancer | 是 | [链接](https://zenodo.org/records/6633721) | CC BY 4.0 | | CheXpert | 否 | [链接](https://stanfordmlgroup.github.io/competitions/chexpert/) | Stanford University Dataset Research Use Agreement | | CISC | 是 | [链接](https://academictorrents.com/details/99f2c7b57b95500711e33f2ee4d14c9fd7c7366c) | CC BY-NC-SA 4.0 | | CPD | 是 | [链接](https://zenodo.org/records/7282326) | CC BY 4.0 | | CT-RATE | 是 | [链接](https://huggingface.co/datasets/ibrahimhamamci/CT-RATE) | CC BY-NC-SA 4.0 | | DeepLesion | 是 | [链接](https://huggingface.co/datasets/farrell236/DeepLesion) | CC BY 4.0 | | FLARE23 | 是 | [链接](https://codalab.lisn.upsaclay.fr/competitions/12239) | CC-BY-NC-SA 4.0 | | IHC4BC | 是 | [链接](https://www.kaggle.com/datasets/akbarnejad1991/ihc4bc-compressed/data) | Unknown | | KIPA22 | 否 | [链接](https://kipa22.grand-challenge.org/) | CC BY-NC-ND 3.0 | | LLaVA-Med | 是 | [链接](https://github.com/LMMMEng/LLD-MMRI-Dataset) | CC BY NC 4.0 | | MAMA-MIA | 是 | [链接](https://www.synapse.org/Synapse:syn60868042/wiki/628716) | CC BY NC 4.0 | | MIMIC-CXR-JPG | 否 | [链接](https://physionet.org/content/mimic-cxr-jpg/2.1.0/) | PhysioNet Credentialed Health Data License 1.5.0 | | NCT-CRC-HE-100K | 是 | [链接](https://www.kaggle.com/datasets/imrankhan77/nct-crc-he-100k) | CC0 1.0 Universal | | NIH-CXR | 是 | [链接](https://www.kaggle.com/datasets/nih-chest-xrays/data) | CC0 1.0 Universal | | PadChest | 否 | [链接](https://bimcv.cipf.es/bimcv-projects/padchest/) | PADCHEST Dataset Research Use Agreement | | PatchGastricADC22 | 是 | [链接](https://zenodo.org/records/6550925) | CC BY 4.0 | | Path-VQA | 是 | [链接](https://huggingface.co/datasets/flaviagiammarino/path-vqa) | MIT | | PMC-OA | 是 | [链接](https://huggingface.co/datasets/axiong/pmc_oa) | OpenRAIL | | PMC-VQA | 是 | [链接](https://huggingface.co/datasets/xmcmic/PMC-VQA) | CC BY-SA | | TCGA | 是 | [链接](https://zenodo.org/records/5889558) | CC-BY-NC-SA 4.0 | | PTCGA | 是 | [链接](https://drive.google.com/drive/folders/18CmL-WLyppK1Rk29CgV7ib5MACFzg5ei) | CC-BY-NC-SA 4.0 | | QUILT-1M | 否 | [链接](https://zenodo.org/records/8239942) | - | | SA-Med2D-20M | 是 | [链接](https://openxlab.org.cn/datasets/GMAI/SA-Med2D-20M) | - | | SLAKE | 是 | [链接](https://www.med-vqa.com/slake/) | CC BY 4.0 | | ULS23 | 是 | [链接](https://zenodo.org/records/10035161) | CC BY-NC-SA 4.0 | | VALSET | 否 | [链接](https://zenodo.org/records/7548828) | - | | VQA-RAD | 是 | [链接](https://osf.io/89kps/) | CC0 1.0 Universal | ## 免责声明 MedTrinity-25M是公开可用数据集的汇编，旨在回馈社区，为研究人员与开发者提供学术与技术研究资源。任何使用本数据集的个人或组织（以下简称“用户”）必须遵守以下免责条款： 1. **数据集来源**：本数据集由多个公开可用数据集组成，其来源已在预印本论文中明确标注。用户必须遵守原始数据集的相关许可协议与使用条款。 2. **数据准确性**：尽管已尽力确保数据集的准确性与完整性，但我们无法保证其绝对准确。用户需自行承担使用本数据集的全部风险与责任。 3. **责任限制**：在任何情况下，数据集提供者或贡献者均不对用户使用本数据集所产生的任何行为或结果承担责任。 4. **使用限制**：用户在使用本数据集时必须遵守适用的法律法规与伦理标准。不得将本数据集用于非法、侵犯隐私、诽谤、歧视或其他违法或不道德的用途。 5. **知识产权**：本数据集的知识产权归属于各原始数据集的相应权利持有人。用户不得以任何方式侵犯数据集的知识产权。作为非营利组织，我们团队倡导和谐友好的开源交流环境。若您发现该开源数据集中存在侵犯您合法权益的内容，请发送邮件至（yxie126@ucsc.edu）。邮件中请详细说明涉嫌侵权的内容，并提供相关的所有权证明材料。我们将在3个工作日内启动调查程序，并采取必要措施解决问题（例如移除相关数据）。但请确保您的投诉真实准确，因虚假投诉导致的任何不利后果将由您自行承担。通过下载、复制、访问或使用本数据集，即表示用户已阅读、理解并同意遵守本免责声明中的所有条款与条件。若用户无法接受本免责声明的任何部分，请不要使用本数据集。 ## 许可与引用请尊重上表中各子数据集的单独许可协议。当您在研究中使用这些数据集时，请务必引用原始来源并遵守其各自的使用条款。如果您的研究与应用中使用了MedTrinity-25M，请使用以下BibTeX格式进行引用： bibtex @misc{xie2024medtrinity25mlargescalemultimodaldataset, title={MedTrinity-25M: A Large-scale Multimodal Dataset with Multigranular Annotations for Medicine}, author={Yunfei Xie and Ce Zhou and Lang Gao and Juncheng Wu and Xianhang Li and Hong-Yu Zhou and Sheng Liu and Lei Xing and James Zou and Cihang Xie and Yuyin Zhou}, year={2024}, eprint={2408.02900}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2408.02900}, }

提供机构：

maas

创建时间：

2024-08-09

搜集汇总

数据集介绍

背景与挑战

背景概述

MedTrinity-25M是一个大规模多模态医学数据集，包含超过2500万张图像，覆盖10种模态和65种以上疾病，提供丰富的多粒度注释，包括全局文本描述和局部区域标注。该数据集支持多种医学AI任务，如报告生成和图像分割，并分为演示、完整文本和可访问子集以适配不同研究需求，旨在促进医学基础模型的开发。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集