exafluence/Open-MedQA-Nexus

Name: exafluence/Open-MedQA-Nexus
Creator: exafluence
Published: 2024-10-15 03:10:33
License: 暂无描述

Hugging Face2024-10-15 更新2025-04-12 收录

下载链接：

https://hf-mirror.com/datasets/exafluence/Open-MedQA-Nexus

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 dataset_info: features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: source dtype: string - name: source_url dtype: string splits: - name: train num_bytes: 1330442127 num_examples: 646749 download_size: 602658811 dataset_size: 1330442127 configs: - config_name: default data_files: - split: train path: data/train-* task_categories: - question-answering - text-generation language: - en tags: - medicine - healthcare size_categories: - 100K<n<1M --- # Open Nexus MedQA  This dataset combines various publicly available medical datasets like ChatDoctor, icliniq, etc., into a unified format for training and evaluating medical question-answering models. ## Dataset Details  Open Nexus MedQA is a comprehensive dataset designed to facilitate the development of advanced medical question answering systems. It integrates diverse medical data sources, meticulously processed to provide a uniform format. The format includes: Instructions: Clear and concise instructions for each question. Inputs: Medical queries ranging from simple to complex. Outputs: Accurate and informative responses to the corresponding questions. Source Information: Details about the original dataset from which each example was derived. - **Curated by:** Exafluence Inc - **Shared by:** Exafluence Inc - **Language(s) (NLP):** English - **License:** Apache License 2.0 ### Dataset Sources  Open Nexus MedQA integrates data from a diverse range of publicly available medical datasets. Here's a breakdown of the sources: **ChatDoctor-based Datasets:** - Alpaca Data - ChatDoctor: [Link](https://github.com/Kent0n-Li/ChatDoctor/) - icliniq.com - ChatDoctor: [Link](https://drive.google.com/file/d/1ZKbqgYqWc7DJHs3N9TQYQVPdDQmZaClA/view) - HealthCareMagic.com - ChatDoctor: [Link](https://drive.google.com/file/d/1lyfqIwlLSClhgrCutWuEe_IACNq6XNUt/view) **Hugging Face Datasets:**) - CareQA - HPAI-BSC: [Link](https://huggingface.co/datasets/HPAI-BSC/CareQA) - medmcqa_mixtral_cot - HPAI-BSC: [Link](https://huggingface.co/datasets/HPAI-BSC/medmcqa-cot) - medqa_mixtral_cot - HPAI-BSC: [Link](https://huggingface.co/datasets/HPAI-BSC/medqa-cot) - pubmedqa_mixtral_cot - HPAI-BSC: [Link](https://huggingface.co/datasets/HPAI-BSC/pubmedqa-cot) **Other Datasets:** - MedInstruct-52k: [Link](https://huggingface.co/datasets/lavita/AlpaCare-MedInstruct-52k) - US QBank: [Link](https://github.com/jind11/MedQA) **Note:** We actively encourage users to explore the original datasets for further details. References to the original datasets will be provided within the dataset metadata. ## Uses  ### Direct Use  Open Nexus MedQA can be used for various purposes: - Research: Train and evaluate medical question answering models. - Development: Build and improve AI-powered medical applications (chatbots, virtual assistants, diagnostic tools). - Education: Enhance the understanding of medical information retrieval for students and professionals. ### Out-of-Scope Use  - Direct diagnosis or treatment: The dataset is not intended for medical diagnosis or treatment. Consult with qualified healthcare professionals for proper medical care. - Commercial use without permission: The initial release allows non-commercial use. Refer to the license for commercial applications. ## Dataset Structure  The dataset contains records in a unified format: - Instruction: Text indicating the task or question. - Input: Medical query or prompt for the question. - Output: Corresponding accurate and informative answer. - Source: Information about the original dataset from which the record originated. - Source URL: URL link for source dataset ## Dataset Creation ### Curation Rationale  We aimed to create a comprehensive and diverse medical question-answering dataset by merging various public datasets. This unified format allows researchers and developers to build robust medical NLP models. ### Source Data The dataset integrates publicly available medical datasets like ChatDoctor, icliniq, careqa, healthcare-magic, pubmed qa, medqa, med mcqa, med instruct, and us qbank. #### Data Collection and Processing  Each source dataset underwent various processing steps to achieve a consistent format: - Data Extraction: Relevant data points (instructions, inputs, outputs) were extracted from each source. - Normalization: Text processing steps like cleaning, tokenization, and normalization were applied. - Alignment: Data was aligned to the unified format with instruction, input, output, and source information columns. #### Who are the source data producers?  The source datasets were created by various independent organizations or researchers. We acknowledge their contributions and provide references to the original sources within the dataset metadata. ### Recommendations  Users should be made aware of the risks, biases and limitations of the dataset. ## Dataset Card Authors [Jeevan J](https://huggingface.co/jeevan-exa)

提供机构：

exafluence

5,000+

优质数据集

54 个

任务类型

进入经典数据集