five

MedVLThinker-m23k-tokenized

收藏
魔搭社区2025-12-05 更新2025-08-23 收录
下载链接:
https://modelscope.cn/datasets/UCSC-VLAA/MedVLThinker-m23k-tokenized
下载链接
链接失效反馈
官方服务:
资源简介:
Code: https://github.com/UCSC-VLAA/MedVLThinker Project Page: https://ucsc-vlaa.github.io/MedVLThinker/ ## 📊 Datasets ### Available Datasets Our project provides several curated datasets for medical vision-language understanding and training: | Dataset | Modality | Description | Download | |---------|-----|-------------|----------| | **MedVLThinker-m23k-tokenized** | Text-only | Tokenized version of the [m23k](https://github.com/UCSC-VLAA/m1) dataset | [🤗 HF](https://huggingface.co/datasets/UCSC-VLAA/MedVLThinker-m23k-tokenized) | | **MedVLThinker-pmc_vqa-gpt_4o_reasoning-tokenized** | Image-Text | Tokenized PMC-VQA dataset with GPT-4o generated reasoning chains | [🤗 HF](https://huggingface.co/datasets/UCSC-VLAA/MedVLThinker-pmc_vqa-gpt_4o_reasoning-tokenized) | | **MedVLThinker-pmc_vqa** | Image-Text |Processed PMC-VQA dataset for medical visual question answering with RLVR | [🤗 HF](https://huggingface.co/datasets/UCSC-VLAA/MedVLThinker-pmc_vqa) | | **MedVLThinker-Eval** | Image-Text| Comprehensive evaluation dataset for medical VQA benchmarks | [🤗 HF](https://huggingface.co/datasets/UCSC-VLAA/MedVLThinker-Eval) | ### Dataset Usage ```python from datasets import load_dataset # Load evaluation dataset eval_dataset = load_dataset("UCSC-VLAA/MedVLThinker-Eval") # Load training dataset with reasoning train_dataset = load_dataset("UCSC-VLAA/MedVLThinker-pmc_vqa-gpt_4o_reasoning-tokenized") # Load PMC-VQA dataset pmc_dataset = load_dataset("UCSC-VLAA/MedVLThinker-pmc_vqa") # Load Medical23k tokenized dataset m23k_dataset = load_dataset("UCSC-VLAA/MedVLThinker-m23k-tokenized") ``` <details><summary>Dataset details and preparation of your own</summary> ### Supported Datasets Our framework supports evaluation on the following medical VQA datasets: - **PMC-VQA**: PubMed Central Visual Question Answering - **PathVQA**: Pathology Visual Question Answering - **SLAKE**: Bilingual medical VQA dataset - **VQA-RAD**: Radiology Visual Question Answering - **MMMU Medical**: Medical subsets from MMMU benchmark - **MedXpertQA**: Expert-level medical questions ### Data Format All datasets follow a unified format: ```python { "images": [PIL.Image], # List of images "question": str, # Question text "options": Dict[str, str], # Multiple choice options "answer_label": str, # Correct answer label (A, B, C, D) "answer": str, # Full answer text "reasoning": str, # Chain-of-thought reasoning (optional) "dataset_name": str, # Source dataset name "dataset_index": int # Unique sample identifier } ```

代码仓库:https://github.com/UCSC-VLAA/MedVLThinker 项目主页:https://ucsc-vlaa.github.io/MedVLThinker/ ## 📊 数据集 ### 可用数据集 本项目提供多个经过精选的医学视觉语言理解与训练数据集: | 数据集名称 | 模态 | 描述 | 下载地址 | |---------|-----|-------------|----------| | **MedVLThinker-m23k-tokenized** | 纯文本 | [m23k](https://github.com/UCSC-VLAA/m1) 数据集的分词版本 | [🤗 Hugging Face(HF)](https://huggingface.co/datasets/UCSC-VLAA/MedVLThinker-m23k-tokenized) | | **MedVLThinker-pmc_vqa-gpt_4o_reasoning-tokenized** | 图像-文本 | 带有GPT-4o生成推理链的分词版PMC-VQA数据集 | [🤗 Hugging Face(HF)](https://huggingface.co/datasets/UCSC-VLAA/MedVLThinker-pmc_vqa-gpt_4o_reasoning-tokenized) | | **MedVLThinker-pmc_vqa** | 图像-文本 | 基于RLVR处理的、用于医学视觉问答任务的PMC-VQA数据集 | [🤗 Hugging Face(HF)](https://huggingface.co/datasets/UCSC-VLAA/MedVLThinker-pmc_vqa) | | **MedVLThinker-Eval** | 图像-文本 | 用于医学视觉问答基准测试的综合评估数据集 | [🤗 Hugging Face(HF)](https://huggingface.co/datasets/UCSC-VLAA/MedVLThinker-Eval) | ### 数据集使用方法 python from datasets import load_dataset # 加载评估数据集 eval_dataset = load_dataset("UCSC-VLAA/MedVLThinker-Eval") # 加载带推理链的训练数据集 train_dataset = load_dataset("UCSC-VLAA/MedVLThinker-pmc_vqa-gpt_4o_reasoning-tokenized") # 加载PMC-VQA数据集 pmc_dataset = load_dataset("UCSC-VLAA/MedVLThinker-pmc_vqa") # 加载Medical23k分词数据集 m23k_dataset = load_dataset("UCSC-VLAA/MedVLThinker-m23k-tokenized") <details><summary>数据集详情与自定义数据集制备</summary> ### 支持的数据集 本框架支持在以下医学视觉问答数据集上进行评估: - **PMC-VQA**:PubMed Central 视觉问答数据集 - **PathVQA**:病理学视觉问答数据集 - **SLAKE**:双语医学视觉问答数据集 - **VQA-RAD**:放射学视觉问答数据集 - **MMMU Medical**:MMMU基准测试的医学子集 - **MedXpertQA**:专家级医学问答数据集 ### 数据格式 所有数据集均遵循统一格式: python { "images": [PIL.Image], # PIL.Image格式的图像列表 "question": str, # 问题文本(字符串类型) "options": Dict[str, str], # 多项选择选项(字典格式,键为选项标签,值为选项文本) "answer_label": str, # 正确答案标签(如A、B、C、D) "answer": str, # 完整答案文本 "reasoning": str, # 思维链推理内容(可选字段) "dataset_name": str, # 源数据集名称 "dataset_index": int # 样本唯一标识符(整数类型) } </details>
提供机构:
maas
创建时间:
2025-08-03
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作