five

RobCzikkel/DoctorGPT

收藏
Hugging Face2023-12-05 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/RobCzikkel/DoctorGPT
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en size_categories: - 10K<n<100K task_categories: - conversational pretty_name: Doctor & Patient dataset_info: features: - name: prompt dtype: string - name: input_ids sequence: int32 - name: length dtype: int64 - name: attention_mask sequence: int8 splits: - name: train num_bytes: 42127351.778204426 num_examples: 13125 - name: test num_bytes: 10534245.221795576 num_examples: 3282 download_size: 10917910 dataset_size: 52661597.0 configs: - config_name: default data_files: - split: train path: data/train-* - split: test path: data/test-* tags: - biology - medical --- ### Dataset This is an edited and tokenized version of the MedQuad-MedicalQnADataset dataset by keivalya. The original dataset contains 16K+ questions and answers between patient and doctor, which have been converted into a full prompt to train BioGPT by Microsoft. ##### Tokenizer used microsoft/BioGPT-Large (BPE tokenizer) ### Full prompt ```py prompt = f"""You are a helpful AI Doctor who answers medical questions. Below is a question from a patient. Your task is to answer the questions as truthfully as you can. ### Patient: {sample['Question']} ### Doctor: {sample['Answer']}""" ``` ### Notes Since bioGPT has a max input of 1024, the full prompt was truncated to stay below this limit. The truncation strategy I used made sure that only full sentences were produced. Please note that this dataset is for research/testing only, it should not be used in a real setting or used to give medical advice to people.
提供机构:
RobCzikkel
原始信息汇总

数据集概述

基本信息

  • 语言: 英语
  • 大小类别: 10K<n<100K
  • 任务类别: 对话
  • 美观名称: Doctor & Patient

数据集详情

  • 特征:
    • prompt: 字符串类型
    • input_ids: 序列类型,int32
    • length: int64类型
    • attention_mask: 序列类型,int8
  • 分割:
    • train: 42127351.778204426字节,13125个样本
    • test: 10534245.221795576字节,3282个样本
  • 下载大小: 10917910字节
  • 数据集大小: 52661597.0字节

配置

  • 默认配置:
    • 数据文件:
      • train: data/train-*
      • test: data/test-*

标签

  • biology
  • medical

数据集描述

  • 该数据集是MedQuad-MedicalQnADataset的编辑和标记化版本,由keivalya提供。
  • 原始数据集包含16K+的医患问答,已转换为完整提示以训练Microsoft的BioGPT。

使用的标记器

  • microsoft/BioGPT-Large (BPE tokenizer)

完整提示

python prompt = f"""You are a helpful AI Doctor who answers medical questions. Below is a question from a patient. Your task is to answer the questions as truthfully as you can.

Patient:

{sample[Question]}

Doctor:

{sample[Answer]}"""

注意事项

  • 由于BioGPT的最大输入为1024,完整提示已被截断以保持在限制以下。
  • 截断策略确保只生成完整句子。
  • 该数据集仅用于研究/测试,不应在实际环境中使用或用于向人们提供医疗建议。
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作