RobCzikkel/DoctorGPT
收藏Hugging Face2023-12-05 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/RobCzikkel/DoctorGPT
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
size_categories:
- 10K<n<100K
task_categories:
- conversational
pretty_name: Doctor & Patient
dataset_info:
features:
- name: prompt
dtype: string
- name: input_ids
sequence: int32
- name: length
dtype: int64
- name: attention_mask
sequence: int8
splits:
- name: train
num_bytes: 42127351.778204426
num_examples: 13125
- name: test
num_bytes: 10534245.221795576
num_examples: 3282
download_size: 10917910
dataset_size: 52661597.0
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
- split: test
path: data/test-*
tags:
- biology
- medical
---
### Dataset
This is an edited and tokenized version of the MedQuad-MedicalQnADataset dataset by keivalya.
The original dataset contains 16K+ questions and answers between patient and doctor, which have been converted into a full prompt to train BioGPT by Microsoft.
##### Tokenizer used
microsoft/BioGPT-Large (BPE tokenizer)
### Full prompt
```py
prompt = f"""You are a helpful AI Doctor who answers medical questions. Below is a question from a patient. Your task is to answer the questions as truthfully as you can.
### Patient:
{sample['Question']}
### Doctor:
{sample['Answer']}"""
```
### Notes
Since bioGPT has a max input of 1024, the full prompt was truncated to stay below this limit.
The truncation strategy I used made sure that only full sentences were produced.
Please note that this dataset is for research/testing only, it should not be used in a real setting or used to give medical advice to people.
提供机构:
RobCzikkel
原始信息汇总
数据集概述
基本信息
- 语言: 英语
- 大小类别: 10K<n<100K
- 任务类别: 对话
- 美观名称: Doctor & Patient
数据集详情
- 特征:
- prompt: 字符串类型
- input_ids: 序列类型,int32
- length: int64类型
- attention_mask: 序列类型,int8
- 分割:
- train: 42127351.778204426字节,13125个样本
- test: 10534245.221795576字节,3282个样本
- 下载大小: 10917910字节
- 数据集大小: 52661597.0字节
配置
- 默认配置:
- 数据文件:
- train: data/train-*
- test: data/test-*
- 数据文件:
标签
- biology
- medical
数据集描述
- 该数据集是MedQuad-MedicalQnADataset的编辑和标记化版本,由keivalya提供。
- 原始数据集包含16K+的医患问答,已转换为完整提示以训练Microsoft的BioGPT。
使用的标记器
- microsoft/BioGPT-Large (BPE tokenizer)
完整提示
python prompt = f"""You are a helpful AI Doctor who answers medical questions. Below is a question from a patient. Your task is to answer the questions as truthfully as you can.
Patient:
{sample[Question]}
Doctor:
{sample[Answer]}"""
注意事项
- 由于BioGPT的最大输入为1024,完整提示已被截断以保持在限制以下。
- 截断策略确保只生成完整句子。
- 该数据集仅用于研究/测试,不应在实际环境中使用或用于向人们提供医疗建议。



