curaihealth/medical_questions_pairs

Name: curaihealth/medical_questions_pairs
Creator: curaihealth
Published: 2024-01-04 14:27:42
License: 暂无描述

Hugging Face2024-01-04 更新2024-06-15 收录

下载链接：

https://hf-mirror.com/datasets/curaihealth/medical_questions_pairs

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - expert-generated language_creators: - other language: - en license: - unknown multilinguality: - monolingual size_categories: - 1K<n<10K source_datasets: - original task_categories: - text-classification task_ids: - semantic-similarity-classification pretty_name: MedicalQuestionsPairs dataset_info: features: - name: dr_id dtype: int32 - name: question_1 dtype: string - name: question_2 dtype: string - name: label dtype: class_label: names: '0': 0 '1': 1 splits: - name: train num_bytes: 701642 num_examples: 3048 download_size: 313704 dataset_size: 701642 configs: - config_name: default data_files: - split: train path: data/train-* --- # Dataset Card for [medical_questions_pairs] ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Repository:** [Medical questions pairs repository](https://github.com/curai/medical-question-pair-dataset) - **Paper:** [Effective Transfer Learning for Identifying Similar Questions:Matching User Questions to COVID-19 FAQs](https://arxiv.org/abs/2008.13546) ### Dataset Summary This dataset consists of 3048 similar and dissimilar medical question pairs hand-generated and labeled by Curai's doctors. Doctors with a list of 1524 patient-asked questions randomly sampled from the publicly available crawl of [HealthTap](https://github.com/durakkerem/Medical-Question-Answer-Datasets). Each question results in one similar and one different pair through the following instructions provided to the labelers: - Rewrite the original question in a different way while maintaining the same intent. Restructure the syntax as much as possible and change medical details that would not impact your response. e.g. "I'm a 22-y-o female" could become "My 26 year old daughter" - Come up with a related but dissimilar question for which the answer to the original question would be WRONG OR IRRELEVANT. Use similar key words. The first instruction generates a positive question pair (similar) and the second generates a negative question pair (different). With the above instructions, the task was intentionally framed such that positive question pairs can look very different by superficial metrics, and negative question pairs can conversely look very similar. This ensures that the task is not trivial. ### Supported Tasks and Leaderboards - `text-classification` : The dataset can be used to train a model to identify similar and non similar medical question pairs. ### Languages The text in the dataset is in English. ## Dataset Structure ### Data Instances The dataset contains dr_id, question_1, question_2, label. 11 different doctors were used for this task so dr_id ranges from 1 to 11. The label is 1 if the question pair is similar and 0 otherwise. ### Data Fields - `dr_id`: 11 different doctors were used for this task so dr_id ranges from 1 to 11 - `question_1`: Original Question - `question_2`: Rewritten Question maintaining the same intent like Original Question - `label`: The label is 1 if the question pair is similar and 0 otherwise. ### Data Splits The dataset as of now consists of only one split(train) but can be split seperately based on the requirement | | train | |----------------------------|------:| | Non similar Question Pairs | 1524 | | Similar Question Pairs | 1524 | ## Dataset Creation Doctors with a list of 1524 patient-asked questions randomly sampled from the publicly available crawl of [HealthTap](https://github.com/durakkerem/Medical-Question-Answer-Datasets). Each question results in one similar and one different pair through the following instructions provided to the labelers: - Rewrite the original question in a different way while maintaining the same intent. Restructure the syntax as much as possible and change medical details that would not impact your response. e.g. "I'm a 22-y-o female" could become "My 26 year old daughter" - Come up with a related but dissimilar question for which the answer to the original question would be WRONG OR IRRELEVANT. Use similar key words. The first instruction generates a positive question pair (similar) and the second generates a negative question pair (different). With the above instructions, the task was intentionally framed such that positive question pairs can look very different by superficial metrics, and negative question pairs can conversely look very similar. This ensures that the task is not trivial. ### Curation Rationale [More Information Needed] ### Source Data 1524 patient-asked questions randomly sampled from the publicly available crawl of [HealthTap](https://github.com/durakkerem/Medical-Question-Answer-Datasets) #### Initial Data Collection and Normalization [More Information Needed] #### Who are the source language producers? [More Information Needed] ### Annotations [More Information Needed] #### Annotation process Doctors with a list of 1524 patient-asked questions randomly sampled from the publicly available crawl of [HealthTap](https://github.com/durakkerem/Medical-Question-Answer-Datasets). Each question results in one similar and one different pair through the following instructions provided to the labelers: - Rewrite the original question in a different way while maintaining the same intent. Restructure the syntax as much as possible and change medical details that would not impact your response. e.g. "I'm a 22-y-o female" could become "My 26 year old daughter" - Come up with a related but dissimilar question for which the answer to the original question would be WRONG OR IRRELEVANT. Use similar key words. The first instruction generates a positive question pair (similar) and the second generates a negative question pair (different). With the above instructions, the task was intentionally framed such that positive question pairs can look very different by superficial metrics, and negative question pairs can conversely look very similar. This ensures that the task is not trivial. #### Who are the annotators? **Curai's doctors** ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data [More Information Needed] ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information [More Information Needed] ### Dataset Curators [More Information Needed] ### Licensing Information [More Information Needed] ### Citation Information ``` @misc{mccreery2020effective, title={Effective Transfer Learning for Identifying Similar Questions: Matching User Questions to COVID-19 FAQs}, author={Clara H. McCreery and Namit Katariya and Anitha Kannan and Manish Chablani and Xavier Amatriain}, year={2020}, eprint={2008.13546}, archivePrefix={arXiv}, primaryClass={cs.IR} } ``` ### Contributions Thanks to [@tuner007](https://github.com/tuner007) for adding this dataset.

提供机构：

curaihealth

原始信息汇总

数据集概述

数据集描述

数据集摘要

该数据集包含3048对相似和不相似的医疗问题，由Curai的医生手工生成和标注。医生从公开可用的HealthTap爬取数据中随机抽取了1524个患者提出的问题。每个问题通过以下指导生成一对相似问题和一对不相似问题：

以不同的方式重写原始问题，同时保持相同的意图。尽可能重组语法，并更改不会影响回答的医疗细节。例如，“我是一个22岁的女性”可以变成“我26岁的女儿”。
提出一个相关但不相似的问题，其答案对原始问题来说是错误的或无关的。使用相似的关键词。

第一个指导生成正向问题对（相似），第二个生成负向问题对（不同）。通过这些指导，任务被有意设计为正向问题对在表面指标上可能看起来非常不同，而负向问题对则可能看起来非常相似，确保任务不简单。

支持的任务和排行榜

text-classification：该数据集可用于训练模型识别相似和不相似的医疗问题对。

语言

数据集中的文本为英语。

数据集结构

数据实例

数据集包含dr_id、question_1、question_2和label。共有11位不同的医生参与此任务，因此dr_id范围从1到11。如果问题对相似，则label为1，否则为0。

数据字段

dr_id：参与任务的11位不同医生的ID，范围从1到11。
question_1：原始问题。
question_2：重写的问题，保持与原始问题相同的意图。
label：如果问题对相似，则label为1，否则为0。

数据分割

数据集目前仅包含一个训练集分割，但可以根据需求进行单独分割。

	train
不相似问题对	1524
相似问题对	1524

数据集创建

医生从公开可用的HealthTap爬取数据中随机抽取了1524个患者提出的问题。每个问题通过以下指导生成一对相似问题和一对不相似问题：

以不同的方式重写原始问题，同时保持相同的意图。尽可能重组语法，并更改不会影响回答的医疗细节。例如，“我是一个22岁的女性”可以变成“我26岁的女儿”。
提出一个相关但不相似的问题，其答案对原始问题来说是错误的或无关的。使用相似的关键词。

标注过程

医生从公开可用的HealthTap爬取数据中随机抽取了1524个患者提出的问题。每个问题通过以下指导生成一对相似问题和一对不相似问题：

以不同的方式重写原始问题，同时保持相同的意图。尽可能重组语法，并更改不会影响回答的医疗细节。例如，“我是一个22岁的女性”可以变成“我26岁的女儿”。
提出一个相关但不相似的问题，其答案对原始问题来说是错误的或无关的。使用相似的关键词。

标注者

Curai的医生

数据集信息

annotations_creators: expert-generated
language_creators: other
language: en
license: unknown
multilinguality: monolingual
size_categories: 1K<n<10K
source_datasets: original
task_categories: text-classification
task_ids: semantic-similarity-classification
pretty_name: MedicalQuestionsPairs
dataset_info:
- features:
  - name: dr_id dtype: int32
  - name: question_1 dtype: string
  - name: question_2 dtype: string
  - name: label dtype: class_label: names: 0: 0 1: 1
- splits:
  - name: train num_bytes: 701642 num_examples: 3048
- download_size: 313704
- dataset_size: 701642
configs:
- config_name: default data_files:
  - split: train path: data/train-*

搜集汇总

数据集介绍

构建方式

在医学自然语言处理领域，构建高质量语义相似度数据集对提升医疗问答系统的精准性至关重要。该数据集源自HealthTap公开爬取的1524条患者提问，经随机抽样后由Curai医疗专家进行专业标注。每位医生依据特定指令，针对每条原始问题分别生成语义相似与语义相异的配对问题：相似对要求改写原句但保持意图一致，并尽可能调整句法结构及不影响回答的医学细节；相异对则需构思相关但答案错误或无关的问题，同时保留关键词。通过这种严谨的标注流程，最终形成包含3048个问句对的平衡数据集。

特点

该数据集在医学文本相似度任务中展现出鲜明的专业特性。其标注工作完全由执业医生完成，确保了医学语境下语义判断的准确性与权威性。数据构造经过精心设计，使得语义相似的问句对在表层词汇和句法上可能差异显著，而语义相异的问句对却可能共享大量关键词，从而有效避免了基于浅层特征的简单匹配，提升了任务的挑战性与实用性。数据集规模适中，包含3048个样本，且正负例均衡分布，为模型训练提供了稳定基础。所有文本均为英文，聚焦于医疗健康领域的自然语言理解。

使用方法

该数据集主要用于医疗文本语义相似度分类任务的模型训练与评估。研究者可借助`question_1`与`question_2`字段构成输入文本对，并以`label`字段作为二分类标签（1表示相似，0表示相异）进行监督学习。鉴于数据集目前仅提供训练集，使用者需根据研究需求自行划分验证集与测试集以进行模型调优与性能测评。该数据集适用于训练或微调BERT等预训练模型，旨在提升模型对医学问题意图匹配的判别能力，可广泛应用于智能医疗问答、患者咨询归类及FAQ匹配等实际场景。

背景与挑战

背景概述

在医疗自然语言处理领域，精准识别语义相似的医学问题对提升智能问诊系统的效能至关重要。Curaihealth/medical_questions_pairs数据集由Curai机构于2020年创建，其核心研究聚焦于医学问题对的语义相似性分类。该数据集源自HealthTap公开爬取的1524条患者提问，经由11位医学专家精心标注，生成了3048对相似与不相似的医学问题对。这一资源的构建旨在推动迁移学习在医学文本匹配中的应用，特别是在COVID-19常见问题匹配等场景中展现出显著影响力，为医疗人工智能的语义理解研究提供了高质量基准。

当前挑战

该数据集致力于解决医学问题语义相似性判定的核心挑战，即如何在表面词汇高度重叠但意图迥异、或句法结构差异显著但语义一致的复杂情境中实现精准分类。构建过程中的挑战主要体现在标注环节：医学专家需遵循严格指令，一方面重构问题句法并调整非关键医疗细节以生成正样本，另一方面设计关键词相似但答案无效的负样本，这要求标注者具备深厚的医学知识以平衡语义一致性与表面差异性，确保任务非平凡性，同时避免引入领域偏见或信息失真。

常用场景

经典使用场景

在医疗自然语言处理领域，该数据集为语义相似度分类任务提供了经典范例。通过专家生成的医疗问题对，模型能够学习识别不同表述但意图相同的问题，以及表面相似但实质迥异的问题。这种设计使得模型必须深入理解医疗文本的语义内涵，而非依赖浅层词汇匹配，从而有效提升了医疗问答系统中问题匹配的准确性与鲁棒性。

衍生相关工作

基于该数据集衍生的经典工作包括COVID-19常见问题匹配研究，其中采用了有效的迁移学习策略。后续研究进一步探索了预训练语言模型在医疗领域的适配，如BioBERT与ClinicalBERT在该数据集上的微调与评估。这些工作不仅验证了数据集在跨领域语义迁移中的价值，也推动了医疗文本相似度计算与自动问答技术的标准化评测框架发展。

数据集最近研究