five

KerwinFu/M3LLM-PMC

收藏
Hugging Face2025-11-21 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/KerwinFu/M3LLM-PMC
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit task_categories: - visual-question-answering - image-text-to-text language: - en tags: - medical - multimodal - vision-language - PMC - medical-vqa size_categories: - 100K<n<1M --- # M3LLM-PMC Training Data This dataset contains the training data for [M3LLM (Medical Multimodal Large Language Model)](https://github.com/franciszchen/M3LLM), comprising ~238K high-quality synthetic medical instruction-following samples. ## Dataset Description The data is generated from PubMed Central (PMC) medical literature through a comprehensive 5-stage synthetic data pipeline, covering six diverse medical visual question answering tasks. ### Dataset Statistics | File | Samples | Task Type | Description | |------|---------|-----------|-------------| | `puretext.jsonl` | 40,382 | Pure Text QA | Text-only medical question answering | | `boundingboxVQA.jsonl` | 40,293 | Spatial Reasoning | Questions about spatial relationships using bounding boxes | | `single_subimage.jsonl` | 40,287 | Single Image QA | Reasoning about individual sub-images | | `multi_subimage.jsonl` | 39,462 | Multi-Image QA | Reasoning across multiple sub-images | | `subimage_option.jsonl` | 40,295 | Multiple Choice | Four-choice questions about medical images | | `compound_image.jsonl` | 37,029 | Compound Figure | Understanding complex compound medical figures | | **Total** | **~238K** | **6 Tasks** | **Comprehensive medical VQA coverage** | ## Data Format Each JSONL file contains one JSON object per line with the following structure: ```json { "image": "path/to/image.jpg", "caption": "Original image caption", "qa_pairs": [ { "question": "Medical question about the image", "answer": "Detailed medical answer", "context": "Additional context (task-dependent)", "improved context": "Refined context without answer leakage" } ] } ``` ## Usage ### Loading with Datasets Library ```python from datasets import load_dataset # Load entire dataset dataset = load_dataset("KerwinFu/M3LLM-PMC") # Load specific task puretext_data = load_dataset("KerwinFu/M3LLM-PMC", data_files="puretext.jsonl") ``` ### Manual Download ```bash # Download all files git clone https://huggingface.co/datasets/KerwinFu/M3LLM-PMC # Or download specific files wget https://huggingface.co/datasets/KerwinFu/M3LLM-PMC/resolve/main/puretext.jsonl ``` ## Data Generation Pipeline The data is synthesized through a 5-stage pipeline: 1. **Stage 1-3**: Preprocessing - Inline text summarization - Medical knowledge extraction - Visual perception enhancement 2. **Stage 4**: Task-specific QA generation - Six specialized scripts for different medical VQA tasks - Uses Qwen2.5-32B-Instruct for high-quality generation 3. **Stage 5**: Context refinement - Removes answer-revealing information - Ensures data quality and prevents leakage For detailed pipeline documentation, see the [M3LLM repository](https://github.com/franciszchen/M3LLM/tree/main/Instruction_data_generation). ## Model Training This dataset is used to finetune [InternVL3-8B](https://huggingface.co/OpenGVLab/InternVL3-8B) to create M3LLM. **Training configurations**: - LoRA finetuning: Rank 16, frozen vision backbone - Full model finetuning: Trainable LLM + MLP, frozen vision backbone See [training documentation](https://github.com/franciszchen/M3LLM/tree/main/InternVL) for details. ## Citation If you use this dataset, please cite: ```bibtex @article{m3llm2024, title={M3LLM: Medical Multimodal Large Language Model}, author={[Your Name and Collaborators]}, journal={arXiv preprint arXiv:XXXX.XXXXX}, year={2024} } ``` ## License This dataset is released under the MIT License. Please also cite the original PMC sources when using this data. ## Acknowledgments - [PMC Open Access Subset](https://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/) for source medical literature - [InternVL](https://github.com/OpenGVLab/InternVL) for the base model - [Qwen2.5](https://huggingface.co/Qwen) for synthetic data generation ## Contact For questions or issues, please open an issue on the [M3LLM GitHub repository](https://github.com/franciszchen/M3LLM/issues).

许可证:MIT协议 任务类别: - 视觉问答 - 图像-文本转文本 语言: - 英语 标签: - 医疗 - 多模态 - 视觉-语言 - PubMed Central(PMC) - 医疗视觉问答 样本量区间: - 10万<n<100万 --- # M3LLM-PMC 训练数据集 本数据集为[M3LLM(医疗多模态大语言模型,Medical Multimodal Large Language Model)](https://github.com/franciszchen/M3LLM)的训练数据,包含约23.8万条高质量合成医疗指令跟随样本。 ## 数据集说明 本数据通过一套完整的五阶段合成数据流水线,从PubMed Central(PMC)的医学文献中生成,涵盖六类多样化的医疗视觉问答任务。 ### 数据集统计 | 文件名 | 样本数 | 任务类型 | 描述 | |------|---------|-----------|-------------| | `puretext.jsonl` | 40,382 | 纯文本问答 | 仅文本形式的医疗问答 | | `boundingboxVQA.jsonl` | 40,293 | 空间推理 | 基于边界框的空间关系相关问答 | | `single_subimage.jsonl` | 40,287 | 单图像问答 | 针对单张子图像的推理任务 | | `multi_subimage.jsonl` | 39,462 | 多图像问答 | 跨多张子图像的推理任务 | | `subimage_option.jsonl` | 40,295 | 多项选择 | 针对医学图像的四选一选择题 | | `compound_image.jsonl` | 37,029 | 复合图像 | 复杂复合医学图表理解任务 | | **总计** | **约23.8万** | **6类任务** | **覆盖全维度医疗视觉问答场景** | ## 数据格式 每个JSONL文件的每一行均为一个JSON对象,结构如下: json { "图像": "图像文件路径", "标题": "原始图像标题", "问答对": [ { "问题": "针对该图像的医疗问题", "答案": "详细医疗解答", "上下文": "附加上下文(依任务而定)", "优化后上下文": "经优化后无答案泄露的上下文" } ] } ## 使用方法 ### 借助Datasets库加载 python from datasets import load_dataset # 加载完整数据集 dataset = load_dataset("KerwinFu/M3LLM-PMC") # 加载特定任务子集 puretext_data = load_dataset("KerwinFu/M3LLM-PMC", data_files="puretext.jsonl") ### 手动下载 bash # 下载全部文件 git clone https://huggingface.co/datasets/KerwinFu/M3LLM-PMC # 或下载指定文件 wget https://huggingface.co/datasets/KerwinFu/M3LLM-PMC/resolve/main/puretext.jsonl ## 数据生成流水线 本数据通过五阶段流水线合成生成: 1. **阶段1-3:预处理** - 内嵌文本摘要 - 医学知识提取 - 视觉感知增强 2. **阶段4:任务专属问答生成** - 针对6类医疗视觉问答任务的专用脚本 - 采用Qwen2.5-32B-Instruct模型实现高质量内容生成 3. **阶段5:上下文优化** - 移除包含答案泄露的信息 - 保障数据质量并防止答案提前暴露 如需了解流水线的详细文档,请参阅[M3LLM代码仓库](https://github.com/franciszchen/M3LLM/tree/main/Instruction_data_generation)。 ## 模型训练 本数据集用于微调[InternVL3-8B](https://huggingface.co/OpenGVLab/InternVL3-8B)以构建M3LLM。 **训练配置**: - LoRA微调:秩为16,冻结视觉主干网络 - 全模型微调:可训练大语言模型与MLP层,冻结视觉主干网络 详细信息请参阅[训练文档](https://github.com/franciszchen/M3LLM/tree/main/InternVL)。 ## 引用方式 若使用本数据集,请引用以下文献: bibtex @article{m3llm2024, title={M3LLM: Medical Multimodal Large Language Model}, author={[Your Name and Collaborators]}, journal={arXiv preprint arXiv:XXXX.XXXXX}, year={2024} } ## 许可证 本数据集采用MIT协议发布。使用本数据时,请同时引用PMC的原始文献来源。 ## 致谢 - [PMC开放获取子集](https://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/),提供原始医学文献来源 - [InternVL](https://github.com/OpenGVLab/InternVL),提供基础模型支持 - [Qwen2.5](https://huggingface.co/Qwen),提供合成数据生成模型支持 ## 联系方式 如有疑问或问题,请在[M3LLM GitHub仓库](https://github.com/franciszchen/M3LLM/issues)提交Issue。
提供机构:
KerwinFu
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作