Shekswess/llama2_medquad_instruct_dataset
收藏Hugging Face2024-04-13 更新2024-06-11 收录
下载链接:
https://hf-mirror.com/datasets/Shekswess/llama2_medquad_instruct_dataset
下载链接
链接失效反馈官方服务:
资源简介:
---
size_categories:
- 10K<n<100K
task_categories:
- question-answering
dataset_info:
features:
- name: input
dtype: string
- name: output
dtype: string
- name: instruction
dtype: string
- name: prompt
dtype: string
splits:
- name: train
num_bytes: 47296307
num_examples: 16359
download_size: 17865991
dataset_size: 47296307
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
tags:
- medical
---
Dataset made for instruction supervised finetuning of Llama 2 LLMs based on the Medquad dataset:
- Medquad dataset (https://www.kaggle.com/datasets/jpmiller/layoutlm)
## Medquad
MedQuAD is a comprehensive collection consisting of 47,457 medical question-answer pairs compiled from 12 authoritative sources within the National Institutes of Health (NIH), including domains like cancer.gov, niddk.nih.gov, GARD, and MedlinePlus Health Topics. These question-answer pairs span 37 distinct question types, covering a wide spectrum of medical subjects, including diseases, drugs, and medical procedures. The dataset features additional annotations provided in XML files, facilitating various Information Retrieval (IR) and Natural Language Processing (NLP) tasks. These annotations encompass crucial information such as question type, question focus, synonyms, Unique Identifier (CUI) from the Unified Medical Language System (UMLS), and Semantic Type. Moreover, the dataset includes categorization of question focuses into three main categories: Disease, Drug, or Other, with the exception of collections from MedlinePlus, which exclusively focus on diseases.
提供机构:
Shekswess
原始信息汇总
数据集概述
基本信息
- 大小范围: 10K<n<100K
- 任务类别: 问答(question-answering)
数据集特征
- 输入字段 (
input): 数据类型为字符串 - 输出字段 (
output): 数据类型为字符串 - 指令字段 (
instruction): 数据类型为字符串 - 提示字段 (
prompt): 数据类型为字符串
数据集分割
- 训练集 (
train):- 示例数量: 16359
- 数据大小: 47296307字节
下载与数据集大小
- 下载大小: 17865991字节
- 数据集大小: 47296307字节
配置
- 默认配置 (
default):- 训练数据文件路径:
data/train-*
- 训练数据文件路径:
标签
- 领域: 医疗(medical)



