five

mediflow

收藏
魔搭社区2026-01-09 更新2025-06-07 收录
下载链接:
https://modelscope.cn/datasets/microsoft/mediflow
下载链接
链接失效反馈
官方服务:
资源简介:
# MediFlow A large-scale synthetic instruction dataset of 2.5M rows (~700k unique instructions) for clinical natural language processing covering 14 task types and 98 fine-grained input clinical documents. ## t-SNE 2D Plot of MediFlow Embeddings by Task Types <img src="tsne_mediflow_v0_3_4_5_task.png" alt="TSNE plot of data by task type" style="display: block; margin-left: auto; margin-right: auto; width: 75%; max-width: 100%"/> ## Dataset Splits - `mediflow`: 2.5M instruction data for SFT alignment. - `mediflow_dpo`: ~135k top-quality instructions with GPT-4o generated `rejected_output` for DPO alignment. ## Main Columns - `instruction`: instructions for the task at hand. - `input`: input example on which to apply the task. - `output`: output example of what we expect from applying the instructions on the input. - `task_type`: one of the 14 task types related to natural language processing. - `input_data`: type of input data. - `output_format`: format of the output (`plain_text` or `json`). - `difficulty_level`: one of the six difficulty levels with emphasis on top-3 hardest levels. - `rejected_output`: wrong output to reject with DPO (only `mediflow_dpo`, else ''). - `error_type`: error type introduced in `output` to get `rejected_output` (only `mediflow_dpo`, else ''). There are also LLM-as-a-Judge scores: `quality`, `alignment`, `coherence`, `realism`, and `difficulty`. # Paper [A Modular Approach for Clinical SLMs Driven by Synthetic Data with Pre-Instruction Tuning, Model Merging, and Clinical-Tasks Alignment](https://arxiv.org/abs/2505.10717) # License This dataset is licensed under CDLA 2.0. # Citation @inproceedings{corbeil-etal-2025-modular, title = "A Modular Approach for Clinical {SLM}s Driven by Synthetic Data with Pre-Instruction Tuning, Model Merging, and Clinical-Tasks Alignment", author = "Corbeil, Jean-Philippe and Dada, Amin and Attendu, Jean-Michel and Ben Abacha, Asma and Sordoni, Alessandro and Caccia, Lucas and Beaulieu, Francois and Lin, Thomas and Kleesiek, Jens and Vozila, Paul", editor = "Che, Wanxiang and Nabende, Joyce and Shutova, Ekaterina and Pilehvar, Mohammad Taher", booktitle = "Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)", month = jul, year = "2025", address = "Vienna, Austria", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2025.acl-long.950/", doi = "10.18653/v1/2025.acl-long.950", pages = "19352--19374", ISBN = "979-8-89176-251-0", abstract = "High computation costs and latency of large language models such as GPT-4 have limited their deployment in clinical settings. Small language models (SLMs) offer a cost-effective alternative, but their limited capacity requires biomedical domain adaptation, which remains challenging. An additional bottleneck is the unavailability and high sensitivity of clinical data. To address these challenges, we propose a novel framework for adapting SLMs into high-performing clinical models. We introduce the MediPhi collection of 3.8B-parameter SLMs developed with our novel framework: pre-instruction tuning of experts on relevant medical and clinical corpora (PMC, Medical Guideline, MedWiki, etc.), model merging, and clinical-tasks alignment. To cover most clinical tasks, we extended the CLUE benchmark to CLUE+, doubling its size. Our expert models deliver relative improvements on this benchmark over the base model without any task-specific fine-tuning: 64.3{\%} on medical entities, 49.5{\%} on radiology reports, and 44{\%} on ICD-10 coding (outperforming GPT-4-0125 by 14{\%}). We unify the expert models into MediPhi via model merging, preserving gains across benchmarks. Furthermore, we built the MediFlow collection, a synthetic dataset of 2.5 million high-quality instructions on 14 medical NLP tasks, 98 fine-grained document types, and JSON format support. Alignment of MediPhi using supervised fine-tuning and direct preference optimization achieves further gains of 18.9{\%} on average." }

# MediFlow 本数据集为面向临床自然语言处理的大规模合成指令数据集,包含250万条数据行(约70万条唯一指令),涵盖14类任务与98种细粒度输入临床文档。 ## 基于任务类型的MediFlow嵌入向量t-SNE二维可视化图 <img src="tsne_mediflow_v0_3_4_5_task.png" alt="按任务类型划分的数据t-SNE可视化图" style="display: block; margin-left: auto; margin-right: auto; width: 75%; max-width: 100%"/> ## 数据集划分 - `mediflow`:用于监督微调(Supervised Fine-Tuning, SFT)对齐的250万条指令数据。 - `mediflow_dpo`:包含约13.5万条高质量指令,搭配由GPT-4o生成的`rejected_output`(拒绝输出),用于直接偏好优化(Direct Preference Optimization, DPO)对齐训练。 ## 核心字段 - `instruction`:当前任务对应的指令。 - `input`:待应用任务的输入示例。 - `output`:将指令应用于输入后得到的预期输出示例。 - `task_type`:属于14类自然语言处理相关任务之一。 - `input_data`:输入数据的类型。 - `output_format`:输出格式,可选`plain_text`(纯文本)或`json`。 - `difficulty_level`:分为6个难度等级,重点覆盖难度最高的前3个等级。 - `rejected_output`:用于DPO训练拒绝的错误输出(仅在`mediflow_dpo`子集存在,其余子集为空字符串)。 - `error_type`:为生成`rejected_output`而在`output`中引入的错误类型(仅在`mediflow_dpo`子集存在,其余子集为空字符串)。 此外还包含大语言模型作为评判器(LLM-as-a-Judge)生成的评分:`quality`(质量)、`alignment`(对齐性)、`coherence`(连贯性)、`realism`(真实性)与`difficulty`(难度)。 # 相关论文 [《面向临床小语言模型的模块化开发框架:基于合成数据结合预指令微调、模型合并与临床任务对齐》(A Modular Approach for Clinical SLMs Driven by Synthetic Data with Pre-Instruction Tuning, Model Merging, and Clinical-Tasks Alignment)](https://arxiv.org/abs/2505.10717) # 许可协议 本数据集采用CDLA 2.0许可协议进行授权。 # 引用格式 bibtex @inproceedings{corbeil-etal-2025-modular, title = "面向临床小语言模型(Small Language Models, SLMs)的模块化开发框架:基于合成数据结合预指令微调、模型合并与临床任务对齐", author = "Corbeil, Jean-Philippe and Dada, Amin and Attendu, Jean-Michel and Ben Abacha, Asma and Sordoni, Alessandro and Caccia, Lucas and Beaulieu, Francois and Lin, Thomas and Kleesiek, Jens and Vozila, Paul", editor = "Che, Wanxiang and Nabende, Joyce and Shutova, Ekaterina and Pilehvar, Mohammad Taher", booktitle = "第63届国际计算语言学协会年会论文集(第1卷:长论文)", month = jul, year = "2025", address = "奥地利维也纳", publisher = "国际计算语言学协会", url = "https://aclanthology.org/2025.acl-long.950/", doi = "10.18653/v1/2025.acl-long.950", pages = "19352--19374", ISBN = "979-8-89176-251-0", abstract = "大型语言模型如GPT-4的高昂计算成本与推理延迟,限制了其在临床场景中的部署。小语言模型(Small Language Models, SLMs)提供了一种经济高效的替代方案,但其有限的容量需要进行生物医学领域适配,而这仍颇具挑战。另一瓶颈则是临床数据的稀缺性与高敏感性。为解决上述问题,我们提出了一种将SLM适配为高性能临床模型的全新框架。我们基于该框架开发了参数规模为38亿的MediPhi系列小语言模型:首先在相关医学与临床语料库(PubMed Central(PMC)、医学指南、医学维基等)上对专家模型进行预指令微调,随后进行模型合并,并完成临床任务对齐。为覆盖绝大多数临床任务,我们将CLUE基准测试扩展为CLUE+,使其规模翻倍。我们的专家模型在无需任何任务特定微调的情况下,在该基准测试上相较于基础模型实现了相对性能提升:医疗实体识别任务提升64.3%,放射学报告任务提升49.5%,ICD-10编码任务提升44%(较GPT-4-0125模型领先14%)。我们通过模型合并将各专家模型整合为MediPhi系列,保留了各基准测试上的性能增益。此外,我们构建了MediFlow数据集:一个包含250万条高质量指令的合成数据集,覆盖14类医学自然语言处理任务与98种细粒度文档类型,支持JSON格式输出。通过监督微调与直接偏好优化对MediPhi进行对齐后,其性能平均再提升18.9%。" }
提供机构:
maas
创建时间:
2025-07-22
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作