KerwinFu/M3LLM-PMC
收藏Hugging Face2025-11-21 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/KerwinFu/M3LLM-PMC
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
task_categories:
- visual-question-answering
- image-text-to-text
language:
- en
tags:
- medical
- multimodal
- vision-language
- PMC
- medical-vqa
size_categories:
- 100K<n<1M
---
# M3LLM-PMC Training Data
This dataset contains the training data for [M3LLM (Medical Multimodal Large Language Model)](https://github.com/franciszchen/M3LLM), comprising ~238K high-quality synthetic medical instruction-following samples.
## Dataset Description
The data is generated from PubMed Central (PMC) medical literature through a comprehensive 5-stage synthetic data pipeline, covering six diverse medical visual question answering tasks.
### Dataset Statistics
| File | Samples | Task Type | Description |
|------|---------|-----------|-------------|
| `puretext.jsonl` | 40,382 | Pure Text QA | Text-only medical question answering |
| `boundingboxVQA.jsonl` | 40,293 | Spatial Reasoning | Questions about spatial relationships using bounding boxes |
| `single_subimage.jsonl` | 40,287 | Single Image QA | Reasoning about individual sub-images |
| `multi_subimage.jsonl` | 39,462 | Multi-Image QA | Reasoning across multiple sub-images |
| `subimage_option.jsonl` | 40,295 | Multiple Choice | Four-choice questions about medical images |
| `compound_image.jsonl` | 37,029 | Compound Figure | Understanding complex compound medical figures |
| **Total** | **~238K** | **6 Tasks** | **Comprehensive medical VQA coverage** |
## Data Format
Each JSONL file contains one JSON object per line with the following structure:
```json
{
"image": "path/to/image.jpg",
"caption": "Original image caption",
"qa_pairs": [
{
"question": "Medical question about the image",
"answer": "Detailed medical answer",
"context": "Additional context (task-dependent)",
"improved context": "Refined context without answer leakage"
}
]
}
```
## Usage
### Loading with Datasets Library
```python
from datasets import load_dataset
# Load entire dataset
dataset = load_dataset("KerwinFu/M3LLM-PMC")
# Load specific task
puretext_data = load_dataset("KerwinFu/M3LLM-PMC", data_files="puretext.jsonl")
```
### Manual Download
```bash
# Download all files
git clone https://huggingface.co/datasets/KerwinFu/M3LLM-PMC
# Or download specific files
wget https://huggingface.co/datasets/KerwinFu/M3LLM-PMC/resolve/main/puretext.jsonl
```
## Data Generation Pipeline
The data is synthesized through a 5-stage pipeline:
1. **Stage 1-3**: Preprocessing
- Inline text summarization
- Medical knowledge extraction
- Visual perception enhancement
2. **Stage 4**: Task-specific QA generation
- Six specialized scripts for different medical VQA tasks
- Uses Qwen2.5-32B-Instruct for high-quality generation
3. **Stage 5**: Context refinement
- Removes answer-revealing information
- Ensures data quality and prevents leakage
For detailed pipeline documentation, see the [M3LLM repository](https://github.com/franciszchen/M3LLM/tree/main/Instruction_data_generation).
## Model Training
This dataset is used to finetune [InternVL3-8B](https://huggingface.co/OpenGVLab/InternVL3-8B) to create M3LLM.
**Training configurations**:
- LoRA finetuning: Rank 16, frozen vision backbone
- Full model finetuning: Trainable LLM + MLP, frozen vision backbone
See [training documentation](https://github.com/franciszchen/M3LLM/tree/main/InternVL) for details.
## Citation
If you use this dataset, please cite:
```bibtex
@article{m3llm2024,
title={M3LLM: Medical Multimodal Large Language Model},
author={[Your Name and Collaborators]},
journal={arXiv preprint arXiv:XXXX.XXXXX},
year={2024}
}
```
## License
This dataset is released under the MIT License. Please also cite the original PMC sources when using this data.
## Acknowledgments
- [PMC Open Access Subset](https://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/) for source medical literature
- [InternVL](https://github.com/OpenGVLab/InternVL) for the base model
- [Qwen2.5](https://huggingface.co/Qwen) for synthetic data generation
## Contact
For questions or issues, please open an issue on the [M3LLM GitHub repository](https://github.com/franciszchen/M3LLM/issues).
许可证:MIT协议
任务类别:
- 视觉问答
- 图像-文本转文本
语言:
- 英语
标签:
- 医疗
- 多模态
- 视觉-语言
- PubMed Central(PMC)
- 医疗视觉问答
样本量区间:
- 10万<n<100万
---
# M3LLM-PMC 训练数据集
本数据集为[M3LLM(医疗多模态大语言模型,Medical Multimodal Large Language Model)](https://github.com/franciszchen/M3LLM)的训练数据,包含约23.8万条高质量合成医疗指令跟随样本。
## 数据集说明
本数据通过一套完整的五阶段合成数据流水线,从PubMed Central(PMC)的医学文献中生成,涵盖六类多样化的医疗视觉问答任务。
### 数据集统计
| 文件名 | 样本数 | 任务类型 | 描述 |
|------|---------|-----------|-------------|
| `puretext.jsonl` | 40,382 | 纯文本问答 | 仅文本形式的医疗问答 |
| `boundingboxVQA.jsonl` | 40,293 | 空间推理 | 基于边界框的空间关系相关问答 |
| `single_subimage.jsonl` | 40,287 | 单图像问答 | 针对单张子图像的推理任务 |
| `multi_subimage.jsonl` | 39,462 | 多图像问答 | 跨多张子图像的推理任务 |
| `subimage_option.jsonl` | 40,295 | 多项选择 | 针对医学图像的四选一选择题 |
| `compound_image.jsonl` | 37,029 | 复合图像 | 复杂复合医学图表理解任务 |
| **总计** | **约23.8万** | **6类任务** | **覆盖全维度医疗视觉问答场景** |
## 数据格式
每个JSONL文件的每一行均为一个JSON对象,结构如下:
json
{
"图像": "图像文件路径",
"标题": "原始图像标题",
"问答对": [
{
"问题": "针对该图像的医疗问题",
"答案": "详细医疗解答",
"上下文": "附加上下文(依任务而定)",
"优化后上下文": "经优化后无答案泄露的上下文"
}
]
}
## 使用方法
### 借助Datasets库加载
python
from datasets import load_dataset
# 加载完整数据集
dataset = load_dataset("KerwinFu/M3LLM-PMC")
# 加载特定任务子集
puretext_data = load_dataset("KerwinFu/M3LLM-PMC", data_files="puretext.jsonl")
### 手动下载
bash
# 下载全部文件
git clone https://huggingface.co/datasets/KerwinFu/M3LLM-PMC
# 或下载指定文件
wget https://huggingface.co/datasets/KerwinFu/M3LLM-PMC/resolve/main/puretext.jsonl
## 数据生成流水线
本数据通过五阶段流水线合成生成:
1. **阶段1-3:预处理**
- 内嵌文本摘要
- 医学知识提取
- 视觉感知增强
2. **阶段4:任务专属问答生成**
- 针对6类医疗视觉问答任务的专用脚本
- 采用Qwen2.5-32B-Instruct模型实现高质量内容生成
3. **阶段5:上下文优化**
- 移除包含答案泄露的信息
- 保障数据质量并防止答案提前暴露
如需了解流水线的详细文档,请参阅[M3LLM代码仓库](https://github.com/franciszchen/M3LLM/tree/main/Instruction_data_generation)。
## 模型训练
本数据集用于微调[InternVL3-8B](https://huggingface.co/OpenGVLab/InternVL3-8B)以构建M3LLM。
**训练配置**:
- LoRA微调:秩为16,冻结视觉主干网络
- 全模型微调:可训练大语言模型与MLP层,冻结视觉主干网络
详细信息请参阅[训练文档](https://github.com/franciszchen/M3LLM/tree/main/InternVL)。
## 引用方式
若使用本数据集,请引用以下文献:
bibtex
@article{m3llm2024,
title={M3LLM: Medical Multimodal Large Language Model},
author={[Your Name and Collaborators]},
journal={arXiv preprint arXiv:XXXX.XXXXX},
year={2024}
}
## 许可证
本数据集采用MIT协议发布。使用本数据时,请同时引用PMC的原始文献来源。
## 致谢
- [PMC开放获取子集](https://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/),提供原始医学文献来源
- [InternVL](https://github.com/OpenGVLab/InternVL),提供基础模型支持
- [Qwen2.5](https://huggingface.co/Qwen),提供合成数据生成模型支持
## 联系方式
如有疑问或问题,请在[M3LLM GitHub仓库](https://github.com/franciszchen/M3LLM/issues)提交Issue。
提供机构:
KerwinFu



