Shekswess/medical_gemma_instruct_dataset_short

Name: Shekswess/medical_gemma_instruct_dataset_short
Creator: Shekswess
Published: 2024-04-13 20:13:39
License: 暂无描述

Hugging Face2024-04-13 更新2024-06-11 收录

下载链接：

https://hf-mirror.com/datasets/Shekswess/medical_gemma_instruct_dataset_short

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en size_categories: - 1K<n<10K task_categories: - question-answering dataset_info: features: - name: output dtype: string - name: input dtype: string - name: instruction dtype: string - name: prompt dtype: string splits: - name: train num_bytes: 4250252 num_examples: 2000 download_size: 1922523 dataset_size: 4250252 configs: - config_name: default data_files: - split: train path: data/train-* tags: - medical --- Dataset made for instruction supervised finetuning of Gemma LLMs, by combining of medical datasets and getting 2k entries from them: - Medical meadow wikidoc (https://huggingface.co/datasets/medalpaca/medical_meadow_wikidoc/blob/main/README.md) - Medquad (https://www.kaggle.com/datasets/jpmiller/layoutlm) ## Medical meadow wikidoc The Medical Meadow Wikidoc dataset comprises question-answer pairs sourced from WikiDoc, an online platform where medical professionals collaboratively contribute and share contemporary medical knowledge. WikiDoc features two primary sections: the "Living Textbook" and "Patient Information". The "Living Textbook" encompasses chapters across various medical specialties, from which we extracted content. Utilizing GTP-3.5-Turbo, the paragraph headings are transformed into questions and utilized the respective paragraphs as answers. Notably, the structure of "Patient Information" is distinct; each section's subheading already serves as a question, eliminating the necessity for rephrasing. ## Medquad MedQuAD is a comprehensive collection consisting of 47,457 medical question-answer pairs compiled from 12 authoritative sources within the National Institutes of Health (NIH), including domains like cancer.gov, niddk.nih.gov, GARD, and MedlinePlus Health Topics. These question-answer pairs span 37 distinct question types, covering a wide spectrum of medical subjects, including diseases, drugs, and medical procedures. The dataset features additional annotations provided in XML files, facilitating various Information Retrieval (IR) and Natural Language Processing (NLP) tasks. These annotations encompass crucial information such as question type, question focus, synonyms, Unique Identifier (CUI) from the Unified Medical Language System (UMLS), and Semantic Type. Moreover, the dataset includes categorization of question focuses into three main categories: Disease, Drug, or Other, with the exception of collections from MedlinePlus, which exclusively focus on diseases.

提供机构：

Shekswess

原始信息汇总

数据集概述

基本信息

语言: 英语
大小范围: 1K<n<10K
任务类别: 问答

数据集特征

输出 (output): 字符串类型
输入 (input): 字符串类型
指令 (instruction): 字符串类型
提示 (prompt): 字符串类型

数据集分割

训练集 (train):
- 示例数量: 2000
- 字节数: 4250252

数据集大小

下载大小: 1922523字节
数据集大小: 4250252字节

配置

配置名称: default
数据文件:
- 分割: 训练
- 路径: data/train-*