KerwinFu/M3LLM-PMC

Name: KerwinFu/M3LLM-PMC
Creator: KerwinFu
Published: 2025-11-21 21:31:07
License: 暂无描述

Hugging Face2025-11-21 更新2025-12-20 收录

下载链接：

https://hf-mirror.com/datasets/KerwinFu/M3LLM-PMC

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit task_categories: - visual-question-answering - image-text-to-text language: - en tags: - medical - multimodal - vision-language - PMC - medical-vqa size_categories: - 100K<n<1M --- # M3LLM-PMC Training Data This dataset contains the training data for [M3LLM (Medical Multimodal Large Language Model)](https://github.com/franciszchen/M3LLM), comprising ~238K high-quality synthetic medical instruction-following samples. ## Dataset Description The data is generated from PubMed Central (PMC) medical literature through a comprehensive 5-stage synthetic data pipeline, covering six diverse medical visual question answering tasks. ### Dataset Statistics | File | Samples | Task Type | Description | |------|---------|-----------|-------------| | `puretext.jsonl` | 40,382 | Pure Text QA | Text-only medical question answering | | `boundingboxVQA.jsonl` | 40,293 | Spatial Reasoning | Questions about spatial relationships using bounding boxes | | `single_subimage.jsonl` | 40,287 | Single Image QA | Reasoning about individual sub-images | | `multi_subimage.jsonl` | 39,462 | Multi-Image QA | Reasoning across multiple sub-images | | `subimage_option.jsonl` | 40,295 | Multiple Choice | Four-choice questions about medical images | | `compound_image.jsonl` | 37,029 | Compound Figure | Understanding complex compound medical figures | | **Total** | **~238K** | **6 Tasks** | **Comprehensive medical VQA coverage** | ## Data Format Each JSONL file contains one JSON object per line with the following structure: ```json { "image": "path/to/image.jpg", "caption": "Original image caption", "qa_pairs": [ { "question": "Medical question about the image", "answer": "Detailed medical answer", "context": "Additional context (task-dependent)", "improved context": "Refined context without answer leakage" } ] } ``` ## Usage ### Loading with Datasets Library ```python from datasets import load_dataset # Load entire dataset dataset = load_dataset("KerwinFu/M3LLM-PMC") # Load specific task puretext_data = load_dataset("KerwinFu/M3LLM-PMC", data_files="puretext.jsonl") ``` ### Manual Download ```bash # Download all files git clone https://huggingface.co/datasets/KerwinFu/M3LLM-PMC # Or download specific files wget https://huggingface.co/datasets/KerwinFu/M3LLM-PMC/resolve/main/puretext.jsonl ``` ## Data Generation Pipeline The data is synthesized through a 5-stage pipeline: 1. **Stage 1-3**: Preprocessing - Inline text summarization - Medical knowledge extraction - Visual perception enhancement 2. **Stage 4**: Task-specific QA generation - Six specialized scripts for different medical VQA tasks - Uses Qwen2.5-32B-Instruct for high-quality generation 3. **Stage 5**: Context refinement - Removes answer-revealing information - Ensures data quality and prevents leakage For detailed pipeline documentation, see the [M3LLM repository](https://github.com/franciszchen/M3LLM/tree/main/Instruction_data_generation). ## Model Training This dataset is used to finetune [InternVL3-8B](https://huggingface.co/OpenGVLab/InternVL3-8B) to create M3LLM. **Training configurations**: - LoRA finetuning: Rank 16, frozen vision backbone - Full model finetuning: Trainable LLM + MLP, frozen vision backbone See [training documentation](https://github.com/franciszchen/M3LLM/tree/main/InternVL) for details. ## Citation If you use this dataset, please cite: ```bibtex @article{m3llm2024, title={M3LLM: Medical Multimodal Large Language Model}, author={[Your Name and Collaborators]}, journal={arXiv preprint arXiv:XXXX.XXXXX}, year={2024} } ``` ## License This dataset is released under the MIT License. Please also cite the original PMC sources when using this data. ## Acknowledgments - [PMC Open Access Subset](https://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/) for source medical literature - [InternVL](https://github.com/OpenGVLab/InternVL) for the base model - [Qwen2.5](https://huggingface.co/Qwen) for synthetic data generation ## Contact For questions or issues, please open an issue on the [M3LLM GitHub repository](https://github.com/franciszchen/M3LLM/issues).

许可证：MIT协议任务类别： - 视觉问答 - 图像-文本转文本语言： - 英语标签： - 医疗 - 多模态 - 视觉-语言 - PubMed Central（PMC） - 医疗视觉问答样本量区间： - 10万<n<100万 --- # M3LLM-PMC 训练数据集本数据集为[M3LLM（医疗多模态大语言模型，Medical Multimodal Large Language Model）](https://github.com/franciszchen/M3LLM)的训练数据，包含约23.8万条高质量合成医疗指令跟随样本。 ## 数据集说明本数据通过一套完整的五阶段合成数据流水线，从PubMed Central（PMC）的医学文献中生成，涵盖六类多样化的医疗视觉问答任务。 ### 数据集统计 | 文件名 | 样本数 | 任务类型 | 描述 | |------|---------|-----------|-------------| | `puretext.jsonl` | 40,382 | 纯文本问答 | 仅文本形式的医疗问答 | | `boundingboxVQA.jsonl` | 40,293 | 空间推理 | 基于边界框的空间关系相关问答 | | `single_subimage.jsonl` | 40,287 | 单图像问答 | 针对单张子图像的推理任务 | | `multi_subimage.jsonl` | 39,462 | 多图像问答 | 跨多张子图像的推理任务 | | `subimage_option.jsonl` | 40,295 | 多项选择 | 针对医学图像的四选一选择题 | | `compound_image.jsonl` | 37,029 | 复合图像 | 复杂复合医学图表理解任务 | | **总计** | **约23.8万** | **6类任务** | **覆盖全维度医疗视觉问答场景** | ## 数据格式每个JSONL文件的每一行均为一个JSON对象，结构如下： json { "图像": "图像文件路径", "标题": "原始图像标题", "问答对": [ { "问题": "针对该图像的医疗问题", "答案": "详细医疗解答", "上下文": "附加上下文（依任务而定）", "优化后上下文": "经优化后无答案泄露的上下文" } ] } ## 使用方法 ### 借助Datasets库加载 python from datasets import load_dataset # 加载完整数据集 dataset = load_dataset("KerwinFu/M3LLM-PMC") # 加载特定任务子集 puretext_data = load_dataset("KerwinFu/M3LLM-PMC", data_files="puretext.jsonl") ### 手动下载 bash # 下载全部文件 git clone https://huggingface.co/datasets/KerwinFu/M3LLM-PMC # 或下载指定文件 wget https://huggingface.co/datasets/KerwinFu/M3LLM-PMC/resolve/main/puretext.jsonl ## 数据生成流水线本数据通过五阶段流水线合成生成： 1. **阶段1-3：预处理** - 内嵌文本摘要 - 医学知识提取 - 视觉感知增强 2. **阶段4：任务专属问答生成** - 针对6类医疗视觉问答任务的专用脚本 - 采用Qwen2.5-32B-Instruct模型实现高质量内容生成 3. **阶段5：上下文优化** - 移除包含答案泄露的信息 - 保障数据质量并防止答案提前暴露如需了解流水线的详细文档，请参阅[M3LLM代码仓库](https://github.com/franciszchen/M3LLM/tree/main/Instruction_data_generation)。 ## 模型训练本数据集用于微调[InternVL3-8B](https://huggingface.co/OpenGVLab/InternVL3-8B)以构建M3LLM。 **训练配置**： - LoRA微调：秩为16，冻结视觉主干网络 - 全模型微调：可训练大语言模型与MLP层，冻结视觉主干网络详细信息请参阅[训练文档](https://github.com/franciszchen/M3LLM/tree/main/InternVL)。 ## 引用方式若使用本数据集，请引用以下文献： bibtex @article{m3llm2024, title={M3LLM: Medical Multimodal Large Language Model}, author={[Your Name and Collaborators]}, journal={arXiv preprint arXiv:XXXX.XXXXX}, year={2024} } ## 许可证本数据集采用MIT协议发布。使用本数据时，请同时引用PMC的原始文献来源。 ## 致谢 - [PMC开放获取子集](https://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/)，提供原始医学文献来源 - [InternVL](https://github.com/OpenGVLab/InternVL)，提供基础模型支持 - [Qwen2.5](https://huggingface.co/Qwen)，提供合成数据生成模型支持 ## 联系方式如有疑问或问题，请在[M3LLM GitHub仓库](https://github.com/franciszchen/M3LLM/issues)提交Issue。

提供机构：

KerwinFu

5,000+

优质数据集

54 个

任务类型

进入经典数据集