five

aai530-group6/ddxplus|医疗诊断数据集|合成患者数据集

收藏
hugging_face2024-01-22 更新2024-03-04 收录
医疗诊断
合成患者
下载链接:
https://hf-mirror.com/datasets/aai530-group6/ddxplus
下载链接
链接失效反馈
资源简介:
--- language: - en license: cc-by-4.0 license_link: https://creativecommons.org/licenses/by/4.0/ tags: - automatic-diagnosis - automatic-symptom-detection - differential-diagnosis - synthetic-patients - diseases - health-care pretty_name: DDXPlus size_categories: - 1K<n<10K source_datasets: - original task_categories: - tabular-classification task_ids: - multi-class-classification paperswithcode_id: ddxplus configs: - config_name: default data_files: - split: train path: "train.csv" - split: test path: "test.csv" - split: validate path: "validate.csv" extra_gated_prompt: "By accessing this dataset, you agree to use it solely for research purposes and not for clinical decision-making." extra_gated_fields: Consent: checkbox Purpose of use: type: select options: - Research - Educational - label: Other value: other train-eval-index: - config: default task: medical-diagnosis task_id: binary-classification splits: train_split: train eval_split: validate col_mapping: AGE: AGE SEX: SEX PATHOLOGY: PATHOLOGY EVIDENCES: EVIDENCES INITIAL_EVIDENCE: INITIAL_EVIDENCE DIFFERENTIAL_DIAGNOSIS: DIFFERENTIAL_DIAGNOSIS metrics: - type: accuracy name: Accuracy - type: f1 name: F1 Score --- # Dataset Description We are releasing under the CC-BY licence a new large-scale dataset for Automatic Symptom Detection (ASD) and Automatic Diagnosis (AD) systems in the medical domain. The dataset contains patients synthesized using a proprietary medical knowledge base and a commercial rule-based AD system. Patients in the dataset are characterized by their socio-demographic data, a pathology they are suffering from, a set of symptoms and antecedents related to this pathology, and a differential diagnosis. The symptoms and antecedents can be binary, categorical and multi-choice, with the potential of leading to more efficient and natural interactions between ASD/AD systems and patients. To the best of our knowledge, this is the first large-scale dataset that includes the differential diagnosis, and non-binary symptoms and antecedents. **Note**: We use evidence as a general term to refer to a symptom or an antecedent. This directory contains the following files: - **release_evidences.json**: a JSON file describing all possible evidences considered in the dataset. - **release_conditions.json**: a JSON file describing all pathologies considered in the dataset. - **release_train_patients.zip**: a CSV file containing the patients of the training set. - **release_validate_patients.zip**: a CSV file containing the patients of the validation set. - **release_test_patients.zip**: a CSV file containing the patients of the test set. ## Evidence Description Each evidence in the `release_evidences.json` file is described using the following entries: - **name**: name of the evidence. - **code_question**: a code allowing to identify which evidences are related. Evidences having the same `code_question` form a group of related symptoms. The value of the `code_question` refers to the evidence that need to be simulated/activated for the other members of the group to be eventually simulated. - **question_fr**: the query, in French, associated to the evidence. - **question_en**: the query, in English, associated to the evidence. - **is_antecedent**: a flag indicating whether the evidence is an antecedent or a symptom. - **data_type**: the type of evidence. We use `B` for binary, `C` for categorical, and `M` for multi-choice evidences. - **default_value**: the default value of the evidence. If this value is used to characterize the evidence, then it is as if the evidence was not synthesized. - **possible-values**: the possible values for the evidences. Only valid for categorical and multi-choice evidences. - **value_meaning**: The meaning, in French and English, of each code that is part of the `possible-values` field. Only valid for categorical and multi-choice evidences. ## Pathology Description The file `release_conditions.json` contains information about the pathologies that patients in the datasets may suffer from. Each pathology has the following attributes: - **condition_name**: name of the pathology. - **cond-name-fr**: name of the pathology in French. - **cond-name-eng**: name of the pathology in English. - **icd10-id**: ICD-10 code of the pathology. - **severity**: the severity associated with the pathology. The lower the more severe. - **symptoms**: data structure describing the set of symptoms characterizing the pathology. Each symptom is represented by its corresponding `name` entry in the `release_evidences.json` file. - **antecedents**: data structure describing the set of antecedents characterizing the pathology. Each antecedent is represented by its corresponding `name` entry in the `release_evidences.json` file. ## Patient Description Each patient in each of the 3 sets has the following attributes: - **AGE**: the age of the synthesized patient. - **SEX**: the sex of the synthesized patient. - **PATHOLOGY**: name of the ground truth pathology (`condition_name` property in the `release_conditions.json` file) that the synthesized patient is suffering from. - **EVIDENCES**: list of evidences experienced by the patient. An evidence can either be binary, categorical or multi-choice. A categorical or multi-choice evidence is represented in the format `[evidence-name]_@_[evidence-value]` where [`evidence-name`] is the name of the evidence (`name` entry in the `release_evidences.json` file) and [`evidence-value`] is a value from the `possible-values` entry. Note that for a multi-choice evidence, it is possible to have several `[evidence-name]_@_[evidence-value]` items in the evidence list, with each item being associated with a different evidence value. A binary evidence is represented as `[evidence-name]`. - **INITIAL_EVIDENCE**: the evidence provided by the patient to kick-start an interaction with an ASD/AD system. This is useful during model evaluation for a fair comparison of ASD/AD systems as they will all begin an interaction with a given patient from the same starting point. The initial evidence is randomly selected from the binary evidences found in the evidence list mentioned above (i.e., `EVIDENCES`) and it is part of this list. - **DIFFERENTIAL_DIAGNOSIS**: The ground truth differential diagnosis for the patient. It is represented as a list of pairs of the form `[[patho_1, proba_1], [patho_2, proba_2], ...]` where `patho_i` is the pathology name (`condition_name` entry in the `release_conditions.json` file) and `proba_i` is its related probability. ## Note: We hope this dataset will encourage future works for ASD and AD systems that consider the differential diagnosis and the severity of pathologies. It is important to keep in mind that this dataset is formed of synthetic patients and is meant for research purposes. Given the assumptions made during the generation process of this dataset, we would like to emphasize that the dataset should not be used to train and deploy a model prior to performing rigorous evaluations of the model performance and verifying that the system has proper coverage and representation of the population that it will interact with. It is important to understand that the level of specificity, sensitivity and confidence that a physician will seek when evaluating a patient will be influenced by the clinical setting. The dataset was built for acute care and biased toward high mortality and morbidity pathologies. Physicians will tend to consider negative evidences as equally important in such a clinical context in order to evaluate high acuity diseases. In the creation of the DDXPlus dataset, a small subset of the diseases was chosen to establish a baseline. Medical professionals have to consider this very important point when reviewing the results of models trained with this dataset, as the differential is considerably smaller. A smaller differential means less potential evidences to collect. It is thus essential to understand this point when we look at the differential produced and the evidence collected by a model based on this dataset. For more information, please check our [paper](https://arxiv.org/abs/2205.09148).
提供机构:
aai530-group6
原始信息汇总

数据集概述

基本信息

  • 语言: 英语
  • 许可证: CC-BY 4.0
  • 标签: 自动诊断, 自动症状检测, 鉴别诊断, 合成患者, 疾病, 医疗保健
  • 数据集名称: DDXPlus
  • 数据集大小: 1K<n<10K
  • 数据来源: 原始数据
  • 任务类别: 表格分类
  • 任务ID: 多类分类

配置信息

  • 配置名称: default
  • 数据文件:
    • 训练集: train.csv
    • 测试集: test.csv
    • 验证集: validate.csv

额外要求

  • 访问条件: 仅用于研究目的,不用于临床决策。
  • 字段:
    • 同意: 勾选框
    • 使用目的: 选择框 (研究, 教育, 其他)

训练与评估

  • 配置: default
  • 任务: 医疗诊断
  • 任务ID: 二元分类
  • 分割:
    • 训练集: train
    • 验证集: validate
  • 列映射:
    • AGE: 年龄
    • SEX: 性别
    • PATHOLOGY: 病理
    • EVIDENCES: 证据
    • INITIAL_EVIDENCE: 初始证据
    • DIFFERENTIAL_DIAGNOSIS: 鉴别诊断
  • 评估指标:
    • 准确率: Accuracy
    • F1分数: F1 Score

数据集描述

  • 内容: 包含使用专有医学知识库和商业规则基础的自动诊断系统合成的患者数据。患者特征包括社会人口统计数据、病理、相关症状和先兆以及鉴别诊断。
  • 证据: 症状或先兆的通用术语。

文件描述

  • release_evidences.json: 描述数据集中所有可能的证据。
  • release_conditions.json: 描述数据集中所有病理。
  • release_train_patients.zip: 包含训练集患者的CSV文件。
  • release_validate_patients.zip: 包含验证集患者的CSV文件。
  • release_test_patients.zip: 包含测试集患者的CSV文件。

证据描述

  • 字段:
    • name: 证据名称
    • code_question: 相关证据的代码
    • question_fr: 法语查询
    • question_en: 英语查询
    • is_antecedent: 是否为先兆
    • data_type: 证据类型 (B: 二元, C: 分类, M: 多选)
    • default_value: 默认值
    • possible-values: 可能值
    • value_meaning: 值的含义

病理描述

  • 字段:
    • condition_name: 病理名称
    • cond-name-fr: 法语病理名称
    • cond-name-eng: 英语病理名称
    • icd10-id: ICD-10代码
    • severity: 严重程度
    • symptoms: 症状
    • antecedents: 先兆

患者描述

  • 字段:
    • AGE: 年龄
    • SEX: 性别
    • PATHOLOGY: 病理
    • EVIDENCES: 证据
    • INITIAL_EVIDENCE: 初始证据
    • DIFFERENTIAL_DIAGNOSIS: 鉴别诊断

注意事项

  • 数据集包含合成患者,仅用于研究目的。
  • 在使用数据集训练和部署模型之前,需要进行严格的模型性能评估和验证。
  • 数据集偏向于高死亡率和发病率的病理。
AI搜集汇总
数据集介绍
main_image_url
构建方式
DDXPlus数据集的构建基于专有的医疗知识库和商业规则为基础的诊断系统,通过合成患者信息的方式生成。该数据集整合了患者的社会人口学数据、患病病理学、与病理学相关的症状和病史,以及鉴别诊断信息。这一构建过程旨在模拟真实医疗环境中的患者情况,为自动症状检测和自动诊断系统提供训练和评估的基础。
使用方法
使用DDXPlus数据集时,研究人员需遵守Creative Commons BY 4.0许可协议。数据集提供了训练集、验证集和测试集,分别以CSV格式存储在压缩文件中。用户需解压文件后,根据数据集中的证据描述、病理学描述和患者描述等文档进行数据读取和分析。数据集适用于自动症状检测和自动诊断系统的研发和评估,但需注意,该数据集为合成数据,仅供研究使用,不应直接用于临床决策。
背景与挑战
背景概述
DDXPlus数据集是在医学领域自动症状检测与自动诊断系统的研究背景下诞生的。该数据集由合成患者构成,这些患者是基于专有医学知识库和商业规则基础诊断系统生成的。每个患者具有人口统计信息、所患疾病、与疾病相关的症状和前兆,以及鉴别诊断。该数据集是首个包含鉴别诊断和非二元症状与前兆的大规模数据集,其发布旨在推动自动症状检测与自动诊断系统的研究进展,并已获得Creative Commons BY 4.0许可证的授权。DDXPlus数据集的创建,为医学诊断领域提供了一个全新的研究资源,对于提升相关系统的交互效率和自然性具有重大意义。
当前挑战
DDXPlus数据集在构建过程中面临的挑战包括:确保合成患者的真实性和代表性,同时涵盖广泛的症状和前兆;在数据集中平衡疾病严重性和死亡率;以及处理鉴别诊断中疾病范围的限制问题。此外,数据集在应用研究中面临的挑战包括:如何有效利用鉴别诊断信息进行多类别分类任务,以及如何在模型训练和评估中考虑到临床环境中的特异性、敏感性和置信度要求。这些挑战对于研究人员来说既是机遇也是考验,需要他们在研究和应用中不断探索和优化。
常用场景
经典使用场景
在医学领域中,DDXPlus数据集的典型应用场景在于构建自动症状检测(ASD)和自动诊断(AD)系统。该数据集通过模拟患者的社会人口数据、患病情况、相关症状和病史以及鉴别诊断,为研究者提供了丰富的信息资源,以训练模型从而实现从患者数据中自动识别症状并做出初步诊断。
解决学术问题
DDXPlus数据集解决了传统诊断系统中缺乏全面鉴别诊断信息的问题,它提供了包含非二元症状和病史的数据,这对于提升诊断系统的准确性和效率至关重要。此外,该数据集为研究提供了包含不同严重程度疾病的数据,有助于学术研究中对疾病严重性和症状之间关系的探索。
实际应用
在实际应用中,DDXPlus数据集可用于开发和评估面向急性护理场景的诊断系统,特别是在处理高死亡率和高发病率疾病时。该数据集有助于提高医疗系统对急性疾病的响应速度和准确性,从而改善患者的治疗效果和生存率。
数据集最近研究
最新研究方向
DDXPlus数据集为医学领域内的自动症状检测与自动诊断系统提供了大规模的训练资源,其特色在于包含了差异诊断及非二元症状与先兆。近期研究聚焦于利用该数据集提升诊断模型的准确性与特异性,特别是在处理急性病痛与高死亡率病症时。该数据集促使研究者们探索更深入的模型架构,以及更精确的症状与差异诊断之间的关系,以期提高模型的临床适用性,为未来的医疗决策提供有力支持。
以上内容由AI搜集并总结生成
用户留言
有没有相关的论文或文献参考?
这个数据集是基于什么背景创建的?
数据集的作者是谁?
能帮我联系到这个数据集的作者吗?
这个数据集如何下载?
点击留言
数据主题
具身智能
数据集  4098个
机构  8个
大模型
数据集  439个
机构  10个
无人机
数据集  37个
机构  6个
指令微调
数据集  36个
机构  6个
蛋白质结构
数据集  50个
机构  8个
空间智能
数据集  21个
机构  5个
5,000+
优质数据集
54 个
任务类型
进入经典数据集
热门数据集

中国农村金融统计数据

该数据集包含了中国农村金融的统计信息,涵盖了农村金融机构的数量、贷款余额、存款余额、金融服务覆盖率等关键指标。数据按年度和地区分类,提供了详细的农村金融发展状况。

www.pbc.gov.cn 收录

CE-CSL

CE-CSL数据集是由哈尔滨工程大学智能科学与工程学院创建的中文连续手语数据集,旨在解决现有数据集在复杂环境下的局限性。该数据集包含5,988个从日常生活场景中收集的连续手语视频片段,涵盖超过70种不同的复杂背景,确保了数据集的代表性和泛化能力。数据集的创建过程严格遵循实际应用导向,通过收集大量真实场景下的手语视频材料,覆盖了广泛的情境变化和环境复杂性。CE-CSL数据集主要应用于连续手语识别领域,旨在提高手语识别技术在复杂环境中的准确性和效率,促进聋人与听人社区之间的无障碍沟通。

arXiv 收录

HazyDet

HazyDet是由解放军工程大学等机构创建的一个大规模数据集,专门用于雾霾场景下的无人机视角物体检测。该数据集包含383,000个真实世界实例,收集自自然雾霾环境和正常场景中人工添加的雾霾效果,以模拟恶劣天气条件。数据集的创建过程结合了深度估计和大气散射模型,确保了数据的真实性和多样性。HazyDet主要应用于无人机在恶劣天气条件下的物体检测,旨在提高无人机在复杂环境中的感知能力。

arXiv 收录

LibriSpeech

LibriSpeech 是一个大约 1000 小时的 16kHz 英语朗读语音语料库,由 Vassil Panayotov 在 Daniel Povey 的协助下编写。数据来自 LibriVox 项目的已读有声读物,并经过仔细分割和对齐。

OpenDataLab 收录

TT100K - Tsinghua-Tencent 100K

TT100K数据集是一个用于交通标志检测和识别的大规模数据集,包含100,000张标注的交通标志图像。该数据集主要用于计算机视觉和自动驾驶领域的研究。

cg.cs.tsinghua.edu.cn 收录