SIMORD

Name: SIMORD
Creator: maas
Published: 2026-01-07 01:11:52
License: 暂无描述

魔搭社区2026-01-07 更新2025-12-06 收录

下载链接：

https://modelscope.cn/datasets/microsoft/SIMORD

下载链接

链接失效反馈

官方服务：

资源简介：

# Dataset Card: SIMORD (Simulated Medical Order Extraction Dataset) ## Description Medical order extraction involves identifying and structuring various medical orders —such as medications, imaging studies, lab tests, and follow-ups— based on doctor-patient conversations. Previous efforts have focused on extracting entities and relations from clinical texts. This dataset seeks to encourage the developement of effective solutions for improving clinical documentation, reducing the burden on providers, and ensuring critical patient information is accurately captured from long conversations. The input dialogues are sourced from a combination of existing conversational datasets (e.g., ACI-Bench [1], PriMock57 [2]), and structured lists of medical orders are created by medical annotators. ## Dataset Summary - **Name**: SIMORD - **Full name / acronym**: SIMulated ORDer Extraction - **Purpose / use case**: SIMORD is intended to support research in extracting structured medical orders (e.g. medication orders, lab orders) from doctor-patient consultation transcripts. - **Version**: As released with the EMNLP industry track paper (2025) - **License / usage terms**: CDLA-2.0-permissive - **Contact / Maintainer**: jcorbeil@microsoft.com ## Building the dataset ### Method 1: HF datasets 1. Make sure you have `datasets==3.6.0` or less, otherwise builder is not supported in recent versions. 2. Git clone and install requirements from `https://github.com/jpcorb20/mediqa-oe` 3. Add `mediqa-oe` to python path `PYTHONPATH=$PYTHONPATH:/mypath/to/mediqa_oe` (UNIX). 4. Run `load_dataset("microsoft/SIMORD", trust_remote_code=True)`, which will merge transcripts from ACI-Bench and Primock57 repos into the annotation files. ### Method 2: GitHub script Follow the steps in `https://github.com/jpcorb20/mediqa-oe` to merge transcripts from ACI-Bench and Primock57 into the annotation files provided in the repo. ## Data Fields / Format **Input fields**: - **transcript** (dict of list): the doctor-patient consultation transcript as dict of three lists using those keys: - `turn_id` (int): index of that turn. - `speaker` (str): speaker of that turn *DOCTOR* or *PATIENT*. - `transcript` (str): line of that turn. **Output fields**: - A JSON (or list) of **expected orders** - Each order object includes at least: * `order_type` (e.g. “medication”, “lab”) * `description` (string) — the order text (e.g. “lasix 40 milligrams a day”) * `reason` (string) — the clinical reason or indication for the order * `provenance` (e.g. list of token indices or spans) — mapping back to parts of the transcript ## Splits - `train`: examples for in-context learning or fine-tuning. - `test1`: test set used for the EMNLP 2025 industry track paper. Also, previously named `dev` set for MEDIQA-OE shared task of ClinicalNLP 2025. - `test2`: test set for MEDIQA-OE shared task of ClinicalNLP 2025. ## Citation If you use this dataset, please cite: @inproceedings{corbeil-etal-2025-empowering, title = "Empowering Healthcare Practitioners with Language Models: Structuring Speech Transcripts in Two Real-World Clinical Applications", author = "Corbeil, Jean-Philippe and Ben Abacha, Asma and Michalopoulos, George and Swazinna, Phillip and Del-Agua, Miguel and Tremblay, Jerome and Daniel, Akila Jeeson and Bader, Cari and Cho, Kevin and Krishnan, Pooja and Bodenstab, Nathan and Lin, Thomas and Teng, Wenxuan and Beaulieu, Francois and Vozila, Paul", editor = "Potdar, Saloni and Rojas-Barahona, Lina and Montella, Sebastien", booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track", month = nov, year = "2025", address = "Suzhou (China)", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2025.emnlp-industry.58/", doi = "10.18653/v1/2025.emnlp-industry.58", pages = "859--870", ISBN = "979-8-89176-333-3" } ## References [1] Aci-bench: a Novel Ambient Clinical Intelligence Dataset for Benchmarking Automatic Visit Note Generation. Wen-wai Yim, Yujuan Fu, Asma Ben Abacha, Neal Snider, Thomas Lin, Meliha Yetisgen. Nature Scientific Data, 10, 586 (2023). [2] PriMock57: A Dataset Of Primary Care Mock Consultations. Alex Papadopoulos Korfiatis, Francesco Moramarco, Radmila Sarac, and Aleksandar Savkov. 2022. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 588–598, Dublin, Ireland.

# 数据集卡片：SIMORD（模拟医疗医嘱提取数据集，Simulated Medical Order Extraction Dataset） ## 描述医疗医嘱提取指基于医患对话，识别并结构化各类医疗医嘱——如药物处方、影像学检查、实验室检验及随访安排——的任务。既往研究多聚焦于从临床文本中提取实体与关系。本数据集旨在推动研发高效解决方案，以优化临床文档记录、减轻医护人员工作负担，并确保从长对话中精准捕获关键患者信息。输入对话源自现有对话数据集（如ACI-Bench[1]、PriMock57[2]）的组合，医疗医嘱的结构化列表则由医学标注人员创建。 ## 数据集概览 - **名称**：SIMORD - **全称/缩写**：Simulated ORDer Extraction（模拟医疗医嘱提取，简称SIMORD） - **用途/应用场景**：SIMORD旨在支持从医患咨询转录文本中提取结构化医疗医嘱（如药物医嘱、实验室检验医嘱）的相关研究。 - **版本**：随2025年EMNLP工业轨道论文发布 - **许可/使用条款**：CDLA-2.0-permissive - **联系方式/维护者**：jcorbeil@microsoft.com ## 数据集构建方法 ### 方法1：Hugging Face数据集工具 1. 确保安装的`datasets`库版本不高于3.6.0，否则高版本不支持数据集构建器。 2. 克隆仓库并安装依赖：`https://github.com/jpcorb20/mediqa-oe` 3. 将`mediqa-oe`添加至Python路径：`PYTHONPATH=$PYTHONPATH:/mypath/to/mediqa_oe`（适用于UNIX系统）。 4. 运行`load_dataset("microsoft/SIMORD", trust_remote_code=True)`，该命令会将ACI-Bench与PriMock57仓库的转录文本与标注文件合并。 ### 方法2：GitHub脚本按照`https://github.com/jpcorb20/mediqa-oe`中的步骤，将ACI-Bench与PriMock57的转录文本与仓库中提供的标注文件进行合并。 ## 数据字段与格式 **输入字段**： - **transcript**（列表型字典）：医患咨询转录文本，以包含三个列表的字典形式呈现，键分别为： - `turn_id`（整数型）：对话轮次的索引。 - `speaker`（字符串型）：对话发言者，取值为*DOCTOR*（医生）或*PATIENT*（患者）。 - `transcript`（字符串型）：该轮对话的内容。 **输出字段**： - 以JSON（或列表）形式呈现的**预期医嘱** - 每个医嘱对象至少包含以下字段： * `order_type`（如"medication"（药物）、"lab"（实验室检验））：医嘱类型 * `description`（字符串型）：医嘱文本（例如"lasix 40毫克每日一次"） * `reason`（字符串型）：开具该医嘱的临床原因或适应症 * `provenance`（例如Token索引列表或跨度列表）：指向转录文本中对应片段的溯源映射信息 ## 数据集划分 - `train`：用于上下文学习或微调的训练集样本。 - `test1`：用于2025年EMNLP工业轨道论文的测试集，同时也是2025年ClinicalNLP会议MEDIQA-OE共享任务的原`dev`集。 - `test2`：2025年ClinicalNLP会议MEDIQA-OE共享任务的测试集。 ## 引用若使用本数据集，请引用如下文献： bibtex @inproceedings{corbeil-etal-2025-empowering, title = "Empowering Healthcare Practitioners with Language Models: Structuring Speech Transcripts in Two Real-World Clinical Applications", author = "Corbeil, Jean-Philippe and Ben Abacha, Asma and Michalopoulos, George and Swazinna, Phillip and Del-Agua, Miguel and Tremblay, Jerome and Daniel, Akila Jeeson and Bader, Cari and Cho, Kevin and Krishnan, Pooja and Bodenstab, Nathan and Lin, Thomas and Teng, Wenxuan and Beaulieu, Francois and Vozila, Paul", editor = "Potdar, Saloni and Rojas-Barahona, Lina and Montella, Sebastien", booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track", month = nov, year = "2025", address = "Suzhou (China)", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2025.emnlp-industry.58/", doi = "10.18653/v1/2025.emnlp-industry.58", pages = "859--870", ISBN = "979-8-89176-333-3" } ## 参考文献 [1] ACI-Bench：用于基准测试自动就诊记录生成的新型环境临床智能数据集。Wen-wai Yim、Yujuan Fu、Asma Ben Abacha、Neal Snider、Thomas Lin、Meliha Yetisgen。《自然·科学数据》，10卷，586页（2023年）。 [2] PriMock57：初级保健模拟咨询数据集。Alex Papadopoulos Korfiatis、Francesco Moramarco、Radmila Sarac、Aleksandar Savkov。2022年，发表于《第60届国际计算语言学协会年会论文集（第2卷：短文）》，第588–598页，爱尔兰都柏林。

提供机构：

maas

创建时间：

2025-10-09

5,000+

优质数据集

54 个

任务类型

进入经典数据集