RobCzikkel/DoctorGPT

Name: RobCzikkel/DoctorGPT
Creator: RobCzikkel
Published: 2023-12-05 23:05:53
License: 暂无描述

Hugging Face2023-12-05 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/RobCzikkel/DoctorGPT

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en size_categories: - 10K<n<100K task_categories: - conversational pretty_name: Doctor & Patient dataset_info: features: - name: prompt dtype: string - name: input_ids sequence: int32 - name: length dtype: int64 - name: attention_mask sequence: int8 splits: - name: train num_bytes: 42127351.778204426 num_examples: 13125 - name: test num_bytes: 10534245.221795576 num_examples: 3282 download_size: 10917910 dataset_size: 52661597.0 configs: - config_name: default data_files: - split: train path: data/train-* - split: test path: data/test-* tags: - biology - medical --- ### Dataset This is an edited and tokenized version of the MedQuad-MedicalQnADataset dataset by keivalya. The original dataset contains 16K+ questions and answers between patient and doctor, which have been converted into a full prompt to train BioGPT by Microsoft. ##### Tokenizer used microsoft/BioGPT-Large (BPE tokenizer) ### Full prompt ```py prompt = f"""You are a helpful AI Doctor who answers medical questions. Below is a question from a patient. Your task is to answer the questions as truthfully as you can. ### Patient: {sample['Question']} ### Doctor: {sample['Answer']}""" ``` ### Notes Since bioGPT has a max input of 1024, the full prompt was truncated to stay below this limit. The truncation strategy I used made sure that only full sentences were produced. Please note that this dataset is for research/testing only, it should not be used in a real setting or used to give medical advice to people.

提供机构：

RobCzikkel

原始信息汇总

数据集概述

基本信息

语言: 英语
大小类别: 10K<n<100K
任务类别: 对话
美观名称: Doctor & Patient

数据集详情

特征:
- prompt: 字符串类型
- input_ids: 序列类型，int32
- length: int64类型
- attention_mask: 序列类型，int8
分割:
- train: 42127351.778204426字节，13125个样本
- test: 10534245.221795576字节，3282个样本
下载大小: 10917910字节
数据集大小: 52661597.0字节

配置

默认配置:
- 数据文件:
  - train: data/train-*
  - test: data/test-*

数据集描述

该数据集是MedQuad-MedicalQnADataset的编辑和标记化版本，由keivalya提供。
原始数据集包含16K+的医患问答，已转换为完整提示以训练Microsoft的BioGPT。

使用的标记器

microsoft/BioGPT-Large (BPE tokenizer)

完整提示

python prompt = f"""You are a helpful AI Doctor who answers medical questions. Below is a question from a patient. Your task is to answer the questions as truthfully as you can.

Patient:

{sample[Question]}

Doctor:

{sample[Answer]}"""

注意事项

由于BioGPT的最大输入为1024，完整提示已被截断以保持在限制以下。
截断策略确保只生成完整句子。
该数据集仅用于研究/测试，不应在实际环境中使用或用于向人们提供医疗建议。

5,000+

优质数据集

54 个

任务类型

进入经典数据集