ROCOv2-radiology
收藏魔搭社区2026-01-09 更新2025-11-03 收录
下载链接:
https://modelscope.cn/datasets/eltorio/ROCOv2-radiology
下载链接
链接失效反馈官方服务:
资源简介:
# ROCOv2: Radiology Object in COntext version 2
## Introduction
ROCOv2 is a multimodal dataset consisting of radiological images and associated medical concepts and captions extracted from the PMC Open Access Subset. It is an updated version of the ROCO dataset, adding 35,705 new images and improving concept extraction and filtering.
## Dataset Overview
The ROCOv2 dataset contains 79,789 radiological images, each with a corresponding caption and medical concepts. The images are sourced from openly available publications in the PMC Open Access Subset, licensed under CC BY or CC BY-NC.
### Dataset Statistics
* 79,789 radiological images
* 59,958 images in the training set
* 9,904 images in the validation set
* 9,927 images in the test set
* 1,947 unique CUIs overall
* 1,947 CUIs in the training set
* 1,760 CUIs in the validation set
* 1,754 CUIs in the test set
## Dataset Creation
The dataset was created by downloading the full PMC Open Access Subset via FTP, extracting the images and captions, and filtering the images using two binary classification models. The models achieved accuracies of about 90% and 98.6%, respectively.
### Filtering Steps
1. Non-compound image filtering: removed 15,315,657 images
2. Radiological image filtering: removed 64,831 images
3. License filtering: removed 10,392 images from papers not licensed under CC BY or CC BY-NC
4. Duplicate removal: removed 2,056 duplicates
5. Caption filtering: removed 1,528 images with non-English captions and very short captions without relevant information
## Transformers Dataset generation
The dataset hosted in Hugging Face hub was generated with this [notebook](https://colab.research.google.com/#fileId=https://huggingface.co/datasets/eltorio/ROCOv2-radiology/blob/main/generate.ipynb)
All the source images and code can be found on our [GitHub repo](https://github.com/sctg-development/ROCOv2-radiology)
## Dataset Labels and Concepts
The dataset labels and concepts were generated using the Medical Concept Annotation Toolkit v1.10.0 (MedCAT) and manually curated concepts for modality (all images), body region (X-ray only), and directionality (X-ray only).
### Labeling and Concept Generation Workflow
The labeling and concept generation workflow consisted of the following steps:
1. Image caption extraction
2. Concept extraction using MedCAT
3. Manual curation of concepts for modality, body region, and directionality
4. Combination of automatically generated and manually curated concepts
## Use Cases
The ROCOv2 dataset can be used for various applications, including:
* Training image annotation models based on image-caption pairs
* Multi-label image classification using UMLS concepts
* Pre-training of medical domain models
* Evaluation of deep learning models for multi-task learning
* Image retrieval and caption generation tasks
## Citation
If you use the ROCOv2 dataset in your research, please cite the following paper:
Pelka, O., Menze, B. H., & Rexhausen, S. E. (2023). Radiology Objects in COntext version 2 (ROCOv2): A multimodal dataset for medical image analysis.
arXiv preprint arXiv:2405.10004.
```latex
@misc {ronan_l.m._2024,
author = { {Ronan L.M.} },
title = { ROCOv2-radiology (Revision 5d66908) },
year = 2024,
url = { https://huggingface.co/datasets/eltorio/ROCOv2-radiology },
doi = { 10.57967/hf/3489 },
publisher = { Hugging Face }
}
```
## License
The ROCOv2 dataset is licensed under the CC BY-NC-SA 4.0 license.
## Acknowledgments
We acknowledge the National Library of Medicine (NLM) for providing access to the PMC Open Access Subset. We also acknowledge the creators of the Medical Concept Annotation Toolkit (MedCAT) for providing a valuable tool for concept extraction and annotation.
# ROCOv2: 上下文放射学对象版本2
## 简介
ROCOv2是一款多模态数据集,包含从PubMed Central(PMC)开放获取子集中提取的放射学图像、关联医学概念与图像说明文本。本数据集是ROCO数据集的更新版本,新增35705张图像,并优化了概念提取与筛选流程。
## 数据集概览
ROCOv2数据集共包含79789张放射学图像,每张图像均配有对应的说明文本与医学概念。所有图像均来自PMC开放获取子集中的公开出版物,授权协议为CC BY或CC BY-NC。
### 数据集统计
* 79789张放射学图像
* 训练集包含59958张图像
* 验证集包含9904张图像
* 测试集包含9927张图像
* 全局共有1947个唯一概念标识符(Concept Unique Identifier,CUI)
* 训练集包含1947个CUI
* 验证集包含1760个CUI
* 测试集包含1754个CUI
## 数据集构建流程
本数据集通过以下流程构建:通过文件传输协议(File Transfer Protocol,FTP)下载完整的PMC开放获取子集,提取其中的图像与说明文本,并使用两个二分类模型对图像进行筛选。两款模型的准确率分别约为90%与98.6%。
### 筛选步骤
1. 非复合图像筛选:移除15315657张图像
2. 放射学图像筛选:移除64831张非放射学图像
3. 授权协议筛选:移除10392张来自未采用CC BY或CC BY-NC授权协议的论文的图像
4. 重复图像去重:移除2056张重复图像
5. 说明文本筛选:移除1528张配有非英文说明文本或无有效信息的过短说明文本的图像
## Transformers数据集生成
托管于Hugging Face Hub的ROCOv2数据集可通过以下[Colab笔记本](https://colab.research.google.com/#fileId=https://huggingface.co/datasets/eltorio/ROCOv2-radiology/blob/main/generate.ipynb)生成。所有原始图像与代码均可在我们的[GitHub代码仓库](https://github.com/sctg-development/ROCOv2-radiology)中获取。
## 数据集标签与医学概念
本数据集的标签与医学概念通过医学概念注释工具包v1.10.0(Medical Concept Annotation Toolkit v1.10.0,MedCAT)生成,并针对图像模态(所有图像)、身体部位(仅X射线图像)与方向属性(仅X射线图像)进行了人工审核校准。
### 标签与概念生成工作流
标签与概念的生成流程包含以下步骤:
1. 图像说明文本提取
2. 使用MedCAT进行医学概念提取
3. 针对图像模态、身体部位与方向属性的医学概念进行人工审核校准
4. 整合自动生成与人工校准的医学概念
## 应用场景
ROCOv2数据集可应用于多种任务场景,包括:
* 基于图像-文本对的图像注释模型训练
* 采用统一医学语言系统(Unified Medical Language System,UMLS)概念的多标签图像分类
* 医学领域模型的预训练
* 多任务学习深度学习模型的性能评估
* 图像检索与图像说明文本生成任务
## 引用
若您在研究中使用ROCOv2数据集,请引用以下论文:
Pelka, O., Menze, B. H., & Rexhausen, S. E. (2023). 上下文放射学对象版本2(ROCOv2):一款面向医学图像分析的多模态数据集. arXiv预印本 arXiv:2405.10004.
latex
@misc {ronan_l.m._2024,
author = { {Ronan L.M.} },
title = { ROCOv2-radiology (Revision 5d66908) },
year = 2024,
url = { https://huggingface.co/datasets/eltorio/ROCOv2-radiology },
doi = { 10.57967/hf/3489 },
publisher = { Hugging Face }
}
## 授权协议
ROCOv2数据集采用CC BY-NC-SA 4.0授权协议。
## 致谢
我们感谢美国国家医学图书馆(National Library of Medicine,NLM)为我们提供了PMC开放获取子集的访问权限。同时,我们也感谢医学概念注释工具包(MedCAT)的开发者,为我们提供了用于医学概念提取与注释的优质工具。
提供机构:
maas
创建时间:
2025-09-22



