Medical Question Pairs (Medical Question Pairs (MQP) Dataset)
收藏OpenDataLab2026-05-24 更新2024-05-09 收录
下载链接:
https://opendatalab.org.cn/OpenDataLab/Medical_Question_Pairs
下载链接
链接失效反馈官方服务:
资源简介:
医学问题对 (MQP) 数据集
该存储库包含由 Curai 的医生手动生成和标记的 3048 个相似和不同的医学问题对的数据集。该数据集在我们的论文中有详细描述。
方法
我们向我们的医生展示了从 HealthTap 公开可用的爬网中随机抽取的 1524 个患者提出的问题的列表。通过提供给贴标者的以下说明,每个问题都会产生一对相似和不同的对:
以不同的方式重写原始问题,同时保持相同的意图。尽可能重组语法并更改不会影响您的反应的医疗细节。
例如“我是 22 岁的女性”可以变成“我 26 岁的女儿”
提出一个相关但不同的问题,对于该问题,原始问题的答案将是错误的或无关紧要的。使用相似的关键词。
第一条指令生成肯定问题对(相似),第二条指令生成否定问题对(不同)。根据上述说明,我们有意构建任务,使得正面问题对在表面指标上看起来非常不同,而负面问题对相反看起来非常相似。这确保了任务不是微不足道的。
数据集格式
数据集的格式为 dr_id、question_1、question_2、label。我们为此任务使用了 11 位不同的医生,因此 dr_id 的范围从 1 到 11。如果问题对相似,则标签为 1,否则为 0。
数据集统计
最终数据集包含 4567 个独特的问题。这些问题中的最小、最大、中值和平均令牌数分别为 4、81、20 和 22.675,表明问题的长度存在合理的差异。最短的问题是纤维腺瘤是恶性的吗?
现成的医疗实体识别器在问题中发现大约 1000 个独特的医疗实体。一些最重要的实体提及是:医生、怀孕、疼痛、持续数周、月经、情绪状态、癌症、视觉功能、头痛、出血、发烧、性交
Medical Question Pair (MQP) Dataset
This repository contains a dataset of 3048 similar and dissimilar medical question pairs manually generated and labeled by physicians at Curai. This dataset is detailed in our paper.
Methodology
We presented our physicians with a list of 1524 patient questions randomly sampled from publicly available crawls of HealthTap. Following the instructions provided to annotators, each original question was used to generate one similar and one dissimilar pair:
1. Rewrite the original question in a different way while preserving the same intent. Reorganize grammar as much as possible and modify medical details that do not affect the intended response. For example, "I am a 22-year-old female" could be changed to "My 26-year-old daughter".
2. Propose a related but distinct question for which the answer to the original question would be incorrect or irrelevant. Use similar keywords.
The first instruction generates positive (similar) question pairs, while the second generates negative (dissimilar) pairs. Through the above instructions, we intentionally structured the task such that positive pairs appear highly different in surface-level metrics, while negative pairs appear highly similar on the contrary. This ensures the task is non-trivial.
Dataset Format
The dataset follows the format: dr_id, question_1, question_2, label. We employed 11 distinct physicians for this task, so the dr_id ranges from 1 to 11. The label is 1 if the question pair is similar, otherwise 0.
Dataset Statistics
The final dataset contains 4567 unique questions. The minimum, maximum, median, and average token counts of these questions are 4, 81, 20, and 22.675 respectively, indicating a reasonable variation in question lengths. The shortest question is "Is fibroadenoma malignant?".
Off-the-shelf medical named entity recognizers detect approximately 1000 unique medical entities from the questions. Some of the most prominent entity mentions include: physician, pregnancy, pain, persistent for weeks, menstruation, emotional state, cancer, visual function, headache, bleeding, fever, sexual intercourse.
提供机构:
OpenDataLab
创建时间:
2022-06-28
搜集汇总
数据集介绍

背景与挑战
背景概述
该数据集为Medical Question Pairs (MQP),包含3048个由医生手动生成的相似和不同医学问题对,用于识别问题相似性。数据格式包括问题对和标签(1表示相似、0表示不同),涵盖4567个独特问题,平均长度约22.7个令牌,并涉及约1000个医疗实体,由斯坦福大学于2020年发布。
以上内容由遇见数据集搜集并总结生成



