Shekswess/llama2_medquad_instruct_dataset

Name: Shekswess/llama2_medquad_instruct_dataset
Creator: Shekswess
Published: 2024-04-13 19:16:26
License: 暂无描述

Hugging Face2024-04-13 更新2024-06-11 收录

下载链接：

https://hf-mirror.com/datasets/Shekswess/llama2_medquad_instruct_dataset

下载链接

链接失效反馈

官方服务：

资源简介：

--- size_categories: - 10K<n<100K task_categories: - question-answering dataset_info: features: - name: input dtype: string - name: output dtype: string - name: instruction dtype: string - name: prompt dtype: string splits: - name: train num_bytes: 47296307 num_examples: 16359 download_size: 17865991 dataset_size: 47296307 configs: - config_name: default data_files: - split: train path: data/train-* tags: - medical --- Dataset made for instruction supervised finetuning of Llama 2 LLMs based on the Medquad dataset: - Medquad dataset (https://www.kaggle.com/datasets/jpmiller/layoutlm) ## Medquad MedQuAD is a comprehensive collection consisting of 47,457 medical question-answer pairs compiled from 12 authoritative sources within the National Institutes of Health (NIH), including domains like cancer.gov, niddk.nih.gov, GARD, and MedlinePlus Health Topics. These question-answer pairs span 37 distinct question types, covering a wide spectrum of medical subjects, including diseases, drugs, and medical procedures. The dataset features additional annotations provided in XML files, facilitating various Information Retrieval (IR) and Natural Language Processing (NLP) tasks. These annotations encompass crucial information such as question type, question focus, synonyms, Unique Identifier (CUI) from the Unified Medical Language System (UMLS), and Semantic Type. Moreover, the dataset includes categorization of question focuses into three main categories: Disease, Drug, or Other, with the exception of collections from MedlinePlus, which exclusively focus on diseases.

提供机构：

Shekswess

原始信息汇总

数据集概述

基本信息

大小范围: 10K<n<100K
任务类别: 问答（question-answering）

数据集特征

输入字段 (input): 数据类型为字符串
输出字段 (output): 数据类型为字符串
指令字段 (instruction): 数据类型为字符串
提示字段 (prompt): 数据类型为字符串

数据集分割

训练集 (train):
- 示例数量: 16359
- 数据大小: 47296307字节

下载与数据集大小

下载大小: 17865991字节
数据集大小: 47296307字节

配置

默认配置 (default):
- 训练数据文件路径: data/train-*