lavita/medical-qa-datasets

Name: lavita/medical-qa-datasets
Creator: lavita
Published: 2023-11-17 20:49:51
License: 暂无描述

Hugging Face2023-11-17 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/lavita/medical-qa-datasets

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en task_categories: - question-answering tags: - medical - healthcare - clinical dataset_info: - config_name: all-processed features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: __index_level_0__ dtype: int64 splits: - name: train num_bytes: 269589377 num_examples: 239357 download_size: 155267884 dataset_size: 269589377 - config_name: chatdoctor-icliniq features: - name: input dtype: string - name: answer_icliniq dtype: string - name: answer_chatgpt dtype: string - name: answer_chatdoctor dtype: string splits: - name: test num_bytes: 16962106 num_examples: 7321 download_size: 9373079 dataset_size: 16962106 - config_name: chatdoctor_healthcaremagic features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 126454896 num_examples: 112165 download_size: 70518147 dataset_size: 126454896 - config_name: med-qa-en-4options-source features: - name: meta_info dtype: string - name: question dtype: string - name: answer_idx dtype: string - name: answer dtype: string - name: options list: - name: key dtype: string - name: value dtype: string - name: metamap_phrases sequence: string splits: - name: train num_bytes: 15420106 num_examples: 10178 - name: test num_bytes: 1976582 num_examples: 1273 - name: validation num_bytes: 1925861 num_examples: 1272 download_size: 9684872 dataset_size: 19322549 - config_name: med-qa-en-5options-source features: - name: meta_info dtype: string - name: question dtype: string - name: answer_idx dtype: string - name: answer dtype: string - name: options list: - name: key dtype: string - name: value dtype: string splits: - name: train num_bytes: 9765366 num_examples: 10178 - name: test num_bytes: 1248299 num_examples: 1273 - name: validation num_bytes: 1220927 num_examples: 1272 download_size: 6704270 dataset_size: 12234592 - config_name: medical_meadow_cord19 features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 1336834621 num_examples: 821007 download_size: 752855706 dataset_size: 1336834621 - config_name: medical_meadow_health_advice features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 2196957 num_examples: 8676 download_size: 890725 dataset_size: 2196957 - config_name: medical_meadow_medical_flashcards features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 16453987 num_examples: 33955 download_size: 6999958 dataset_size: 16453987 - config_name: medical_meadow_mediqa features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 15690088 num_examples: 2208 download_size: 3719929 dataset_size: 15690088 - config_name: medical_meadow_medqa features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 10225018 num_examples: 10178 download_size: 5505473 dataset_size: 10225018 - config_name: medical_meadow_mmmlu features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 1442124 num_examples: 3787 download_size: 685604 dataset_size: 1442124 - config_name: medical_meadow_pubmed_causal features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 846695 num_examples: 2446 download_size: 210947 dataset_size: 846695 - config_name: medical_meadow_wikidoc features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 10224074 num_examples: 10000 download_size: 5593178 dataset_size: 10224074 - config_name: medical_meadow_wikidoc_patient_information features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 3262558 num_examples: 5942 download_size: 1544286 dataset_size: 3262558 - config_name: medmcqa features: - name: id dtype: string - name: question dtype: string - name: opa dtype: string - name: opb dtype: string - name: opc dtype: string - name: opd dtype: string - name: cop dtype: class_label: names: '0': a '1': b '2': c '3': d - name: choice_type dtype: string - name: exp dtype: string - name: subject_name dtype: string - name: topic_name dtype: string splits: - name: train num_bytes: 131903297 num_examples: 182822 - name: test num_bytes: 1399350 num_examples: 6150 - name: validation num_bytes: 2221428 num_examples: 4183 download_size: 88311484 dataset_size: 135524075 - config_name: mmmlu-anatomy features: - name: input dtype: string - name: A dtype: string - name: B dtype: string - name: C dtype: string - name: D dtype: string - name: target dtype: string splits: - name: test num_bytes: 31810 num_examples: 134 - name: validation num_bytes: 2879 num_examples: 13 - name: train num_bytes: 717 num_examples: 4 download_size: 35632 dataset_size: 35406 - config_name: mmmlu-clinical-knowledge features: - name: input dtype: string - name: A dtype: string - name: B dtype: string - name: C dtype: string - name: D dtype: string - name: target dtype: string splits: - name: test num_bytes: 60710 num_examples: 264 - name: validation num_bytes: 6231 num_examples: 28 - name: train num_bytes: 1026 num_examples: 4 download_size: 60329 dataset_size: 67967 - config_name: mmmlu-college-biology features: - name: input dtype: string - name: A dtype: string - name: B dtype: string - name: C dtype: string - name: D dtype: string - name: target dtype: string splits: - name: test num_bytes: 47319 num_examples: 143 - name: validation num_bytes: 4462 num_examples: 15 - name: train num_bytes: 1103 num_examples: 4 download_size: 49782 dataset_size: 52884 - config_name: mmmlu-college-medicine features: - name: input dtype: string - name: A dtype: string - name: B dtype: string - name: C dtype: string - name: D dtype: string - name: target dtype: string splits: - name: test num_bytes: 80363 num_examples: 172 - name: validation num_bytes: 7079 num_examples: 21 - name: train num_bytes: 1434 num_examples: 4 download_size: 63671 dataset_size: 88876 - config_name: mmmlu-medical-genetics features: - name: input dtype: string - name: A dtype: string - name: B dtype: string - name: C dtype: string - name: D dtype: string - name: target dtype: string splits: - name: test num_bytes: 20021 num_examples: 99 - name: validation num_bytes: 2590 num_examples: 10 - name: train num_bytes: 854 num_examples: 4 download_size: 29043 dataset_size: 23465 - config_name: mmmlu-professional-medicine features: - name: input dtype: string - name: A dtype: string - name: B dtype: string - name: C dtype: string - name: D dtype: string - name: target dtype: string splits: - name: test num_bytes: 214495 num_examples: 271 - name: validation num_bytes: 23003 num_examples: 30 - name: train num_bytes: 2531 num_examples: 4 download_size: 157219 dataset_size: 240029 - config_name: pubmed-qa features: - name: QUESTION dtype: string - name: CONTEXTS sequence: string - name: LABELS sequence: string - name: MESHES sequence: string - name: YEAR dtype: string - name: reasoning_required_pred dtype: string - name: reasoning_free_pred dtype: string - name: final_decision dtype: string - name: LONG_ANSWER dtype: string splits: - name: train num_bytes: 421508218 num_examples: 200000 - name: validation num_bytes: 23762218 num_examples: 11269 download_size: 233536544 dataset_size: 445270436 - config_name: truthful-qa-generation features: - name: type dtype: string - name: category dtype: string - name: question dtype: string - name: best_answer dtype: string - name: correct_answers sequence: string - name: incorrect_answers sequence: string - name: source dtype: string splits: - name: validation num_bytes: 473382 num_examples: 817 download_size: 222648 dataset_size: 473382 - config_name: truthful-qa-multiple-choice features: - name: question dtype: string - name: mc1_targets struct: - name: choices sequence: string - name: labels sequence: int32 - name: mc2_targets struct: - name: choices sequence: string - name: labels sequence: int32 splits: - name: validation num_bytes: 609082 num_examples: 817 download_size: 271032 dataset_size: 609082 - config_name: usmle-self-assessment-step1 features: - name: question dtype: string - name: options struct: - name: A dtype: string - name: B dtype: string - name: C dtype: string - name: D dtype: string - name: E dtype: string - name: F dtype: string - name: G dtype: string - name: H dtype: string - name: I dtype: string - name: answer dtype: string - name: answer_idx dtype: string splits: - name: test num_bytes: 80576 num_examples: 94 download_size: 60550 dataset_size: 80576 - config_name: usmle-self-assessment-step2 features: - name: question dtype: string - name: options struct: - name: A dtype: string - name: B dtype: string - name: C dtype: string - name: D dtype: string - name: E dtype: string - name: F dtype: string - name: G dtype: string - name: answer dtype: string - name: answer_idx dtype: string splits: - name: test num_bytes: 133267 num_examples: 109 download_size: 80678 dataset_size: 133267 - config_name: usmle-self-assessment-step3 features: - name: question dtype: string - name: options struct: - name: A dtype: string - name: B dtype: string - name: C dtype: string - name: D dtype: string - name: E dtype: string - name: F dtype: string - name: G dtype: string - name: answer dtype: string - name: answer_idx dtype: string splits: - name: test num_bytes: 156286 num_examples: 122 download_size: 98163 dataset_size: 156286 configs: - config_name: all-processed data_files: - split: train path: all-processed/train-* - config_name: chatdoctor-icliniq data_files: - split: test path: chatdoctor-icliniq/test-* - config_name: chatdoctor_healthcaremagic data_files: - split: train path: chatdoctor_healthcaremagic/train-* - config_name: med-qa-en-4options-source data_files: - split: train path: med-qa-en-4options-source/train-* - split: test path: med-qa-en-4options-source/test-* - split: validation path: med-qa-en-4options-source/validation-* - config_name: med-qa-en-5options-source data_files: - split: train path: med-qa-en-5options-source/train-* - split: test path: med-qa-en-5options-source/test-* - split: validation path: med-qa-en-5options-source/validation-* - config_name: medical_meadow_cord19 data_files: - split: train path: medical_meadow_cord19/train-* - config_name: medical_meadow_health_advice data_files: - split: train path: medical_meadow_health_advice/train-* - config_name: medical_meadow_medical_flashcards data_files: - split: train path: medical_meadow_medical_flashcards/train-* - config_name: medical_meadow_mediqa data_files: - split: train path: medical_meadow_mediqa/train-* - config_name: medical_meadow_medqa data_files: - split: train path: medical_meadow_medqa/train-* - config_name: medical_meadow_mmmlu data_files: - split: train path: medical_meadow_mmmlu/train-* - config_name: medical_meadow_pubmed_causal data_files: - split: train path: medical_meadow_pubmed_causal/train-* - config_name: medical_meadow_wikidoc data_files: - split: train path: medical_meadow_wikidoc/train-* - config_name: medical_meadow_wikidoc_patient_information data_files: - split: train path: medical_meadow_wikidoc_patient_information/train-* - config_name: medmcqa data_files: - split: train path: medmcqa/train-* - split: test path: medmcqa/test-* - split: validation path: medmcqa/validation-* - config_name: mmmlu-anatomy data_files: - split: test path: mmmlu-anatomy/test-* - split: validation path: mmmlu-anatomy/validation-* - split: train path: mmmlu-anatomy/train-* - config_name: mmmlu-clinical-knowledge data_files: - split: test path: mmmlu-clinical-knowledge/test-* - split: validation path: mmmlu-clinical-knowledge/validation-* - split: train path: mmmlu-clinical-knowledge/train-* - config_name: mmmlu-college-biology data_files: - split: test path: mmmlu-college-biology/test-* - split: validation path: mmmlu-college-biology/validation-* - split: train path: mmmlu-college-biology/train-* - config_name: mmmlu-college-medicine data_files: - split: test path: mmmlu-college-medicine/test-* - split: validation path: mmmlu-college-medicine/validation-* - split: train path: mmmlu-college-medicine/train-* - config_name: mmmlu-medical-genetics data_files: - split: test path: mmmlu-medical-genetics/test-* - split: validation path: mmmlu-medical-genetics/validation-* - split: train path: mmmlu-medical-genetics/train-* - config_name: mmmlu-professional-medicine data_files: - split: test path: mmmlu-professional-medicine/test-* - split: validation path: mmmlu-professional-medicine/validation-* - split: train path: mmmlu-professional-medicine/train-* - config_name: pubmed-qa data_files: - split: train path: pubmed-qa/train-* - split: validation path: pubmed-qa/validation-* - config_name: truthful-qa-generation data_files: - split: validation path: truthful-qa-generation/validation-* - config_name: truthful-qa-multiple-choice data_files: - split: validation path: truthful-qa-multiple-choice/validation-* - config_name: usmle-self-assessment-step1 data_files: - split: test path: usmle-self-assessment-step1/test-* - config_name: usmle-self-assessment-step2 data_files: - split: test path: usmle-self-assessment-step2/test-* - config_name: usmle-self-assessment-step3 data_files: - split: test path: usmle-self-assessment-step3/test-* --- * `all-processed` dataset is a concatenation of of `medical-meadow-*` and `chatdoctor_healthcaremagic` datasets * The `Chat` `Doctor` term is replaced by the `chatbot` term in the `chatdoctor_healthcaremagic` dataset * Similar to the literature the `medical_meadow_cord19` dataset is subsampled to 50,000 samples * `truthful-qa-*` is a benchmark dataset for evaluating the truthfulness of models in text generation, which is used in Llama 2 paper. Within this dataset, there are 55 and 16 questions related to `Health` and `Nutrition`, respectively, making it a valuable resource for medical question-answering scenarios.

提供机构：

lavita

原始信息汇总

数据集概述

数据集配置

`all-processed`

特征:
- instruction: string
- input: string
- output: string
- __index_level_0__: int64
分割:
- train: 239357个样本, 269589377字节
下载大小: 155267884字节
数据集大小: 269589377字节

`chatdoctor-icliniq`

特征:
- input: string
- answer_icliniq: string
- answer_chatgpt: string
- answer_chatdoctor: string
分割:
- test: 7321个样本, 16962106字节
下载大小: 9373079字节
数据集大小: 16962106字节

`chatdoctor_healthcaremagic`

特征:
- instruction: string
- input: string
- output: string
分割:
- train: 112165个样本, 126454896字节
下载大小: 70518147字节
数据集大小: 126454896字节

`med-qa-en-4options-source`

特征:
- meta_info: string
- question: string
- answer_idx: string
- answer: string
- options: list
  - key: string
  - value: string
- metamap_phrases: sequence: string
分割:
- train: 10178个样本, 15420106字节
- test: 1273个样本, 1976582字节
- validation: 1272个样本, 1925861字节
下载大小: 9684872字节
数据集大小: 19322549字节

`med-qa-en-5options-source`

特征:
- meta_info: string
- question: string
- answer_idx: string
- answer: string
- options: list
  - key: string
  - value: string
分割:
- train: 10178个样本, 9765366字节
- test: 1273个样本, 1248299字节
- validation: 1272个样本, 1220927字节
下载大小: 6704270字节
数据集大小: 12234592字节

`medical_meadow_cord19`

特征:
- instruction: string
- input: string
- output: string
分割:
- train: 821007个样本, 1336834621字节
下载大小: 752855706字节
数据集大小: 1336834621字节

`medical_meadow_health_advice`

特征:
- instruction: string
- input: string
- output: string
分割:
- train: 8676个样本, 2196957字节
下载大小: 890725字节
数据集大小: 2196957字节

`medical_meadow_medical_flashcards`

特征:
- instruction: string
- input: string
- output: string
分割:
- train: 33955个样本, 16453987字节
下载大小: 6999958字节
数据集大小: 16453987字节

`medical_meadow_mediqa`

特征:
- instruction: string
- input: string
- output: string
分割:
- train: 2208个样本, 15690088字节
下载大小: 3719929字节
数据集大小: 15690088字节

`medical_meadow_medqa`

特征:
- instruction: string
- input: string
- output: string
分割:
- train: 10178个样本, 10225018字节
下载大小: 5505473字节
数据集大小: 10225018字节

`medical_meadow_mmmlu`

特征:
- instruction: string
- input: string
- output: string
分割:
- train: 3787个样本, 1442124字节
下载大小: 685604字节
数据集大小: 1442124字节

`medical_meadow_pubmed_causal`

特征:
- instruction: string
- input: string
- output: string
分割:
- train: 2446个样本, 846695字节
下载大小: 210947字节
数据集大小: 846695字节

`medical_meadow_wikidoc`

特征:
- instruction: string
- input: string
- output: string
分割:
- train: 10000个样本, 10224074字节
下载大小: 5593178字节
数据集大小: 10224074字节

`medical_meadow_wikidoc_patient_information`

特征:
- instruction: string
- input: string
- output: string
分割:
- train: 5942个样本, 3262558字节
下载大小: 1544286字节
数据集大小: 3262558字节

`medmcqa`

特征:
- id: string
- question: string
- opa: string
- opb: string
- opc: string
- opd: string
- cop: class_label
  - names:
    - 0: a
    - 1: b
    - 2: c
    - 3: d
- choice_type: string
- exp: string
- subject_name: string
- topic_name: string
分割:
- train: 182822个样本, 131903297字节
- test: 6150个样本, 1399350字节
- validation: 4183个样本, 2221428字节
下载大小: 88311484字节
数据集大小: 135524075字节

`mmmlu-anatomy`

特征:
- input: string
- A: string
- B: string
- C: string
- D: string
- target: string
分割:
- test: 134个样本, 31810字节
- validation: 13个样本, 2879字节
- train: 4个样本, 717字节
下载大小: 35632字节
数据集大小: 35406字节

`mmmlu-clinical-knowledge`

特征:
- input: string
- A: string
- B: string
- C: string
- D: string
- target: string
分割:
- test: 264个样本, 60710字节
- validation: 28个样本, 6231字节
- train: 4个样本, 1026字节
下载大小: 60329字节
数据集大小: 67967字节

`mmmlu-college-biology`

特征:
- input: string
- A: string
- B: string
- C: string
- D: string
- target: string
分割:
- test: 143个样本, 47319字节
- validation: 15个样本, 4462字节
- train: 4个样本, 1103字节
下载大小: 49782字节
数据集大小: 52884字节

`mmmlu-college-medicine`

特征:
- input: string
- A: string
- B: string
- C: string
- D: string
- target: string
分割:
- test: 172个样本, 80363字节
- validation: 21个样本, 7079字节
- train: 4个样本, 1434字节
下载大小: 63671字节
数据集大小: 88876字节

`mmmlu-medical-genetics`

特征:
- input: string
- A: string
- B: string
- C: string
- D: string
- target: string
分割:
- test: 99个样本, 20021字节
- validation: 10个样本, 2590字节
- train: 4个样本, 854字节
下载大小: 29043字节
数据集大小: 23465字节

`mmmlu-professional-medicine`

特征:
- input: string
- A: string
- B: string
- C: string
- D: string
- target: string
分割:
- test: 271个样本, 214495字节
- validation: 30个样本, 23003字节
- train: 4个样本, 2531字节
下载大小: 157219字节
数据集大小: 240029字节

`pubmed-qa`

特征:
- QUESTION: string
- CONTEXTS: sequence: string
- LABELS: sequence: string
- MESHES: sequence: string
- YEAR: string
- reasoning_required_pred: string
- reasoning_free_pred: string
- final_decision: string
- LONG_ANSWER: string
分割:
- train: 200000个样本, 421508218字节
- validation: 11269个样本, 23762218字节
下载大小: 233536544字节
数据集大小: 445270436字节

`truthful-qa-generation`

特征:
- type: string
- category: string
- question: string
- best_answer: string
- correct_answers: sequence: string
- incorrect_answers: sequence: string
- source: string
分割:
- validation: 817个样本, 473382字节
下载大小: 222648字节
数据集大小: 473382字节

`truthful-qa-multiple-choice`

特征:
- question: string
- mc1_targets: struct
  - choices: sequence: string
  - labels: sequence: int32
- mc2_targets: struct
  - choices: sequence: string
  - labels: sequence: int32
分割:
- validation: 817个样本, 609082字节
下载大小: 271032字节
数据集大小: 609082字节

`usmle-self-assessment-step1`

特征:
- question: string
- options: struct
  - A: string
  - B: string
  - C: string
  - D: string
  - E: string
  - F: string
  - G: string
  - H: string
  - I: string
- answer: string
- answer_idx: string
分割:
- test: 94个样本, 80576字节
下载大小: 60550字节
数据集大小: 80576字节

`usmle-self-assessment-step2`

特征:
- question: string
- options: struct
  - A: string
  - B: string
  - C: string
  - D: string
  - E: string
  - F: string
  - G: string
- answer: string
- answer_idx: string
分割:
- test: 109个样本, 133267字节
下载大小: 80678字节
数据集大小: 133267字节

`usmle-self-assessment-step3`

特征:
- question: string
- options: struct
  - A: string
  - B: string
  - C: string
  - D: string
  - E: string
  - F: string
  - G: string
- answer: string
- answer_idx: string
分割:
- test: 122个样本, 156286字节
下载大小: 98163字节
数据集大小: 156286字节

数据集配置文件

all-processed:
- train: all-processed/train-*
chatdoctor-icliniq:
- test: chatdoctor-icliniq/test-*
chatdoctor_healthcaremagic:
- train: chatdoctor_healthcaremagic/train-*
med-qa-en-4options-source:
- train: med-qa-en-4options-source/train-*
- test: med-qa-en-4options-source/test-*
- validation: med-qa-en-4options-source/validation-*
med-qa-en-5options-source:
- train: med-qa-en-5options-source/train-*
- test: med-qa-en-5options-source/test-*
- validation: med-qa-en-5options-source/validation-*
medical_meadow_cord19:
- train: medical_meadow_cord19/train-*
medical_meadow_health_advice:
- train: medical_meadow_health_advice/train-*
medical_meadow_medical_flashcards:
- train: medical_meadow_medical_flashcards/

搜集汇总

数据集介绍

构建方式

该数据集通过整合多个医学问答相关子数据集构建而成，包括医学 Meadow、ChatDoctor 以及其他医学问答数据集。构建过程中，对原有数据集进行了清洗、合并和格式统一处理，以确保数据的质量和一致性。

特点

数据集特点在于涵盖了广泛的医学问答场景，包括临床知识、医学遗传学、生物医学等多个领域。同时，数据集包含了多种类型的问题和答案格式，如单选、多选和填空等，能够满足不同模型训练和评估的需求。

使用方法

使用该数据集时，用户可以根据具体的研究需求选择不同的子数据集进行训练或测试。数据集提供了清晰的文件结构和数据格式说明，便于用户快速理解和应用。此外，数据集还支持通过HuggingFace的库进行下载和加载，方便用户进行模型训练和评估。

背景与挑战

背景概述

lavita/medical-qa-datasets数据集是一系列专注于医疗问答领域的集合，涵盖了从临床知识到患者咨询的各种场景。该数据集的构建始于对医疗信息处理需求的深刻认识，旨在为研究者提供丰富的医疗文本资源，以促进医学自然语言处理技术的发展。主要研究人员或机构为lavita，其对相关领域的影响力体现在为医学问答系统提供了多样化的训练和测试数据，从而推动了医学信息学的进步。

当前挑战

该数据集在构建过程中所遇到的挑战主要包括：1) 医疗数据的多样性和复杂性，要求数据集能够覆盖广泛的医学领域和问题类型；2) 医疗信息的敏感性和隐私性，确保数据在收集和处理过程中的合规性；3) 数据标注的准确性，需要医学专业知识进行高质量的标注。在所解决的领域问题方面，该数据集面临的挑战包括：如何提高问答系统的准确性和响应速度，以及如何确保系统在处理真实世界医疗问题时的一致性和可靠性。

常用场景

经典使用场景

在医学问答系统中，该数据集被广泛用于训练模型以理解和回答医学相关的问题，例如患者咨询、医学考试复习和医学知识自测等。

实际应用

在实际应用中，该数据集可用于开发智能医疗助手、在线医疗咨询平台和医学教育软件，提升医疗服务质量和效率。

衍生相关工作

基于该数据集，研究者们开展了大量相关工作，如构建多模态医学问答系统、开发针对特定疾病的问答模型等，推动了医学人工智能领域的发展。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集