five

exafluence/Open-MedQA-Nexus

收藏
Hugging Face2024-10-15 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/exafluence/Open-MedQA-Nexus
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 dataset_info: features: - name: instruction dtype: string - name: input dtype: string - name: output dtype: string - name: source dtype: string - name: source_url dtype: string splits: - name: train num_bytes: 1330442127 num_examples: 646749 download_size: 602658811 dataset_size: 1330442127 configs: - config_name: default data_files: - split: train path: data/train-* task_categories: - question-answering - text-generation language: - en tags: - medicine - healthcare size_categories: - 100K<n<1M --- # Open Nexus MedQA <!-- Provide a quick summary of the dataset. --> This dataset combines various publicly available medical datasets like ChatDoctor, icliniq, etc., into a unified format for training and evaluating medical question-answering models. ## Dataset Details <!-- Provide a longer summary of what this dataset is. --> Open Nexus MedQA is a comprehensive dataset designed to facilitate the development of advanced medical question answering systems. It integrates diverse medical data sources, meticulously processed to provide a uniform format. The format includes: Instructions: Clear and concise instructions for each question. Inputs: Medical queries ranging from simple to complex. Outputs: Accurate and informative responses to the corresponding questions. Source Information: Details about the original dataset from which each example was derived. - **Curated by:** Exafluence Inc - **Shared by:** Exafluence Inc - **Language(s) (NLP):** English - **License:** Apache License 2.0 ### Dataset Sources <!-- Provide the basic links for the dataset. --> Open Nexus MedQA integrates data from a diverse range of publicly available medical datasets. Here's a breakdown of the sources: **ChatDoctor-based Datasets:** - Alpaca Data - ChatDoctor: [Link](https://github.com/Kent0n-Li/ChatDoctor/) - icliniq.com - ChatDoctor: [Link](https://drive.google.com/file/d/1ZKbqgYqWc7DJHs3N9TQYQVPdDQmZaClA/view) - HealthCareMagic.com - ChatDoctor: [Link](https://drive.google.com/file/d/1lyfqIwlLSClhgrCutWuEe_IACNq6XNUt/view) **Hugging Face Datasets:**) - CareQA - HPAI-BSC: [Link](https://huggingface.co/datasets/HPAI-BSC/CareQA) - medmcqa_mixtral_cot - HPAI-BSC: [Link](https://huggingface.co/datasets/HPAI-BSC/medmcqa-cot) - medqa_mixtral_cot - HPAI-BSC: [Link](https://huggingface.co/datasets/HPAI-BSC/medqa-cot) - pubmedqa_mixtral_cot - HPAI-BSC: [Link](https://huggingface.co/datasets/HPAI-BSC/pubmedqa-cot) **Other Datasets:** - MedInstruct-52k: [Link](https://huggingface.co/datasets/lavita/AlpaCare-MedInstruct-52k) - US QBank: [Link](https://github.com/jind11/MedQA) **Note:** We actively encourage users to explore the original datasets for further details. References to the original datasets will be provided within the dataset metadata. ## Uses <!-- Address questions around how the dataset is intended to be used. --> ### Direct Use <!-- This section describes suitable use cases for the dataset. --> Open Nexus MedQA can be used for various purposes: - Research: Train and evaluate medical question answering models. - Development: Build and improve AI-powered medical applications (chatbots, virtual assistants, diagnostic tools). - Education: Enhance the understanding of medical information retrieval for students and professionals. ### Out-of-Scope Use <!-- This section addresses misuse, malicious use, and uses that the dataset will not work well for. --> - Direct diagnosis or treatment: The dataset is not intended for medical diagnosis or treatment. Consult with qualified healthcare professionals for proper medical care. - Commercial use without permission: The initial release allows non-commercial use. Refer to the license for commercial applications. ## Dataset Structure <!-- This section provides a description of the dataset fields, and additional information about the dataset structure such as criteria used to create the splits, relationships between data points, etc. --> The dataset contains records in a unified format: - Instruction: Text indicating the task or question. - Input: Medical query or prompt for the question. - Output: Corresponding accurate and informative answer. - Source: Information about the original dataset from which the record originated. - Source URL: URL link for source dataset ## Dataset Creation ### Curation Rationale <!-- Motivation for the creation of this dataset. --> We aimed to create a comprehensive and diverse medical question-answering dataset by merging various public datasets. This unified format allows researchers and developers to build robust medical NLP models. ### Source Data The dataset integrates publicly available medical datasets like ChatDoctor, icliniq, careqa, healthcare-magic, pubmed qa, medqa, med mcqa, med instruct, and us qbank. #### Data Collection and Processing <!-- This section describes the data collection and processing process such as data selection criteria, filtering and normalization methods, tools and libraries used, etc. --> Each source dataset underwent various processing steps to achieve a consistent format: - Data Extraction: Relevant data points (instructions, inputs, outputs) were extracted from each source. - Normalization: Text processing steps like cleaning, tokenization, and normalization were applied. - Alignment: Data was aligned to the unified format with instruction, input, output, and source information columns. #### Who are the source data producers? <!-- This section describes the people or systems who originally created the data. It should also include self-reported demographic or identity information for the source data creators if this information is available. --> The source datasets were created by various independent organizations or researchers. We acknowledge their contributions and provide references to the original sources within the dataset metadata. ### Recommendations <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. --> Users should be made aware of the risks, biases and limitations of the dataset. ## Dataset Card Authors [Jeevan J](https://huggingface.co/jeevan-exa)
提供机构:
exafluence
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作