five

Shekswess/medical_gemma_instruct_dataset_short

收藏
Hugging Face2024-04-13 更新2024-06-11 收录
下载链接:
https://hf-mirror.com/datasets/Shekswess/medical_gemma_instruct_dataset_short
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en size_categories: - 1K<n<10K task_categories: - question-answering dataset_info: features: - name: output dtype: string - name: input dtype: string - name: instruction dtype: string - name: prompt dtype: string splits: - name: train num_bytes: 4250252 num_examples: 2000 download_size: 1922523 dataset_size: 4250252 configs: - config_name: default data_files: - split: train path: data/train-* tags: - medical --- Dataset made for instruction supervised finetuning of Gemma LLMs, by combining of medical datasets and getting 2k entries from them: - Medical meadow wikidoc (https://huggingface.co/datasets/medalpaca/medical_meadow_wikidoc/blob/main/README.md) - Medquad (https://www.kaggle.com/datasets/jpmiller/layoutlm) ## Medical meadow wikidoc The Medical Meadow Wikidoc dataset comprises question-answer pairs sourced from WikiDoc, an online platform where medical professionals collaboratively contribute and share contemporary medical knowledge. WikiDoc features two primary sections: the "Living Textbook" and "Patient Information". The "Living Textbook" encompasses chapters across various medical specialties, from which we extracted content. Utilizing GTP-3.5-Turbo, the paragraph headings are transformed into questions and utilized the respective paragraphs as answers. Notably, the structure of "Patient Information" is distinct; each section's subheading already serves as a question, eliminating the necessity for rephrasing. ## Medquad MedQuAD is a comprehensive collection consisting of 47,457 medical question-answer pairs compiled from 12 authoritative sources within the National Institutes of Health (NIH), including domains like cancer.gov, niddk.nih.gov, GARD, and MedlinePlus Health Topics. These question-answer pairs span 37 distinct question types, covering a wide spectrum of medical subjects, including diseases, drugs, and medical procedures. The dataset features additional annotations provided in XML files, facilitating various Information Retrieval (IR) and Natural Language Processing (NLP) tasks. These annotations encompass crucial information such as question type, question focus, synonyms, Unique Identifier (CUI) from the Unified Medical Language System (UMLS), and Semantic Type. Moreover, the dataset includes categorization of question focuses into three main categories: Disease, Drug, or Other, with the exception of collections from MedlinePlus, which exclusively focus on diseases.
提供机构:
Shekswess
原始信息汇总

数据集概述

基本信息

  • 语言: 英语
  • 大小范围: 1K<n<10K
  • 任务类别: 问答

数据集特征

  • 输出 (output): 字符串类型
  • 输入 (input): 字符串类型
  • 指令 (instruction): 字符串类型
  • 提示 (prompt): 字符串类型

数据集分割

  • 训练集 (train):
    • 示例数量: 2000
    • 字节数: 4250252

数据集大小

  • 下载大小: 1922523字节
  • 数据集大小: 4250252字节

配置

  • 配置名称: default
  • 数据文件:
    • 分割: 训练
    • 路径: data/train-*

标签

  • 医疗
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作