five

Shekswess/llama2_medquad_instruct_dataset

收藏
Hugging Face2024-04-13 更新2024-06-11 收录
下载链接:
https://hf-mirror.com/datasets/Shekswess/llama2_medquad_instruct_dataset
下载链接
链接失效反馈
官方服务:
资源简介:
--- size_categories: - 10K<n<100K task_categories: - question-answering dataset_info: features: - name: input dtype: string - name: output dtype: string - name: instruction dtype: string - name: prompt dtype: string splits: - name: train num_bytes: 47296307 num_examples: 16359 download_size: 17865991 dataset_size: 47296307 configs: - config_name: default data_files: - split: train path: data/train-* tags: - medical --- Dataset made for instruction supervised finetuning of Llama 2 LLMs based on the Medquad dataset: - Medquad dataset (https://www.kaggle.com/datasets/jpmiller/layoutlm) ## Medquad MedQuAD is a comprehensive collection consisting of 47,457 medical question-answer pairs compiled from 12 authoritative sources within the National Institutes of Health (NIH), including domains like cancer.gov, niddk.nih.gov, GARD, and MedlinePlus Health Topics. These question-answer pairs span 37 distinct question types, covering a wide spectrum of medical subjects, including diseases, drugs, and medical procedures. The dataset features additional annotations provided in XML files, facilitating various Information Retrieval (IR) and Natural Language Processing (NLP) tasks. These annotations encompass crucial information such as question type, question focus, synonyms, Unique Identifier (CUI) from the Unified Medical Language System (UMLS), and Semantic Type. Moreover, the dataset includes categorization of question focuses into three main categories: Disease, Drug, or Other, with the exception of collections from MedlinePlus, which exclusively focus on diseases.
提供机构:
Shekswess
原始信息汇总

数据集概述

基本信息

  • 大小范围: 10K<n<100K
  • 任务类别: 问答(question-answering)

数据集特征

  • 输入字段 (input): 数据类型为字符串
  • 输出字段 (output): 数据类型为字符串
  • 指令字段 (instruction): 数据类型为字符串
  • 提示字段 (prompt): 数据类型为字符串

数据集分割

  • 训练集 (train):
    • 示例数量: 16359
    • 数据大小: 47296307字节

下载与数据集大小

  • 下载大小: 17865991字节
  • 数据集大小: 47296307字节

配置

  • 默认配置 (default):
    • 训练数据文件路径: data/train-*

标签

  • 领域: 医疗(medical)
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作