MedVLThinker-m23k-tokenized
收藏魔搭社区2025-12-05 更新2025-08-23 收录
下载链接:
https://modelscope.cn/datasets/UCSC-VLAA/MedVLThinker-m23k-tokenized
下载链接
链接失效反馈官方服务:
资源简介:
Code: https://github.com/UCSC-VLAA/MedVLThinker
Project Page: https://ucsc-vlaa.github.io/MedVLThinker/
## 📊 Datasets
### Available Datasets
Our project provides several curated datasets for medical vision-language understanding and training:
| Dataset | Modality | Description | Download |
|---------|-----|-------------|----------|
| **MedVLThinker-m23k-tokenized** | Text-only | Tokenized version of the [m23k](https://github.com/UCSC-VLAA/m1) dataset | [🤗 HF](https://huggingface.co/datasets/UCSC-VLAA/MedVLThinker-m23k-tokenized) |
| **MedVLThinker-pmc_vqa-gpt_4o_reasoning-tokenized** | Image-Text | Tokenized PMC-VQA dataset with GPT-4o generated reasoning chains | [🤗 HF](https://huggingface.co/datasets/UCSC-VLAA/MedVLThinker-pmc_vqa-gpt_4o_reasoning-tokenized) |
| **MedVLThinker-pmc_vqa** | Image-Text |Processed PMC-VQA dataset for medical visual question answering with RLVR | [🤗 HF](https://huggingface.co/datasets/UCSC-VLAA/MedVLThinker-pmc_vqa) |
| **MedVLThinker-Eval** | Image-Text| Comprehensive evaluation dataset for medical VQA benchmarks | [🤗 HF](https://huggingface.co/datasets/UCSC-VLAA/MedVLThinker-Eval) |
### Dataset Usage
```python
from datasets import load_dataset
# Load evaluation dataset
eval_dataset = load_dataset("UCSC-VLAA/MedVLThinker-Eval")
# Load training dataset with reasoning
train_dataset = load_dataset("UCSC-VLAA/MedVLThinker-pmc_vqa-gpt_4o_reasoning-tokenized")
# Load PMC-VQA dataset
pmc_dataset = load_dataset("UCSC-VLAA/MedVLThinker-pmc_vqa")
# Load Medical23k tokenized dataset
m23k_dataset = load_dataset("UCSC-VLAA/MedVLThinker-m23k-tokenized")
```
<details><summary>Dataset details and preparation of your own</summary>
### Supported Datasets
Our framework supports evaluation on the following medical VQA datasets:
- **PMC-VQA**: PubMed Central Visual Question Answering
- **PathVQA**: Pathology Visual Question Answering
- **SLAKE**: Bilingual medical VQA dataset
- **VQA-RAD**: Radiology Visual Question Answering
- **MMMU Medical**: Medical subsets from MMMU benchmark
- **MedXpertQA**: Expert-level medical questions
### Data Format
All datasets follow a unified format:
```python
{
"images": [PIL.Image], # List of images
"question": str, # Question text
"options": Dict[str, str], # Multiple choice options
"answer_label": str, # Correct answer label (A, B, C, D)
"answer": str, # Full answer text
"reasoning": str, # Chain-of-thought reasoning (optional)
"dataset_name": str, # Source dataset name
"dataset_index": int # Unique sample identifier
}
```
代码仓库:https://github.com/UCSC-VLAA/MedVLThinker
项目主页:https://ucsc-vlaa.github.io/MedVLThinker/
## 📊 数据集
### 可用数据集
本项目提供多个经过精选的医学视觉语言理解与训练数据集:
| 数据集名称 | 模态 | 描述 | 下载地址 |
|---------|-----|-------------|----------|
| **MedVLThinker-m23k-tokenized** | 纯文本 | [m23k](https://github.com/UCSC-VLAA/m1) 数据集的分词版本 | [🤗 Hugging Face(HF)](https://huggingface.co/datasets/UCSC-VLAA/MedVLThinker-m23k-tokenized) |
| **MedVLThinker-pmc_vqa-gpt_4o_reasoning-tokenized** | 图像-文本 | 带有GPT-4o生成推理链的分词版PMC-VQA数据集 | [🤗 Hugging Face(HF)](https://huggingface.co/datasets/UCSC-VLAA/MedVLThinker-pmc_vqa-gpt_4o_reasoning-tokenized) |
| **MedVLThinker-pmc_vqa** | 图像-文本 | 基于RLVR处理的、用于医学视觉问答任务的PMC-VQA数据集 | [🤗 Hugging Face(HF)](https://huggingface.co/datasets/UCSC-VLAA/MedVLThinker-pmc_vqa) |
| **MedVLThinker-Eval** | 图像-文本 | 用于医学视觉问答基准测试的综合评估数据集 | [🤗 Hugging Face(HF)](https://huggingface.co/datasets/UCSC-VLAA/MedVLThinker-Eval) |
### 数据集使用方法
python
from datasets import load_dataset
# 加载评估数据集
eval_dataset = load_dataset("UCSC-VLAA/MedVLThinker-Eval")
# 加载带推理链的训练数据集
train_dataset = load_dataset("UCSC-VLAA/MedVLThinker-pmc_vqa-gpt_4o_reasoning-tokenized")
# 加载PMC-VQA数据集
pmc_dataset = load_dataset("UCSC-VLAA/MedVLThinker-pmc_vqa")
# 加载Medical23k分词数据集
m23k_dataset = load_dataset("UCSC-VLAA/MedVLThinker-m23k-tokenized")
<details><summary>数据集详情与自定义数据集制备</summary>
### 支持的数据集
本框架支持在以下医学视觉问答数据集上进行评估:
- **PMC-VQA**:PubMed Central 视觉问答数据集
- **PathVQA**:病理学视觉问答数据集
- **SLAKE**:双语医学视觉问答数据集
- **VQA-RAD**:放射学视觉问答数据集
- **MMMU Medical**:MMMU基准测试的医学子集
- **MedXpertQA**:专家级医学问答数据集
### 数据格式
所有数据集均遵循统一格式:
python
{
"images": [PIL.Image], # PIL.Image格式的图像列表
"question": str, # 问题文本(字符串类型)
"options": Dict[str, str], # 多项选择选项(字典格式,键为选项标签,值为选项文本)
"answer_label": str, # 正确答案标签(如A、B、C、D)
"answer": str, # 完整答案文本
"reasoning": str, # 思维链推理内容(可选字段)
"dataset_name": str, # 源数据集名称
"dataset_index": int # 样本唯一标识符(整数类型)
}
</details>
提供机构:
maas
创建时间:
2025-08-03



