PTB-XL ECG dataset

Name: PTB-XL ECG dataset
Creator: www.kaggle.com
Published: 2021-02-03 00:00:00
License: 暂无描述

www.kaggle.com2021-02-03 更新2025-03-25 收录

下载链接：

https://www.kaggle.com/khyeh0719/ptb-xl-dataset

下载链接

链接失效反馈

官方服务：

资源简介：

Source: https://physionet.org/content/ptb-xl/1.0.1/ #Abstract Electrocardiography (ECG) is a key diagnostic tool to assess the cardiac condition of a patient. Automatic ECG interpretation algorithms as diagnosis support systems promise large reliefs for the medical personnel - only on the basis of the number of ECGs that are routinely taken. However, the development of such algorithms requires large training datasets and clear benchmark procedures. In our opinion, both aspects are not covered satisfactorily by existing freely accessible ECG datasets. The PTB-XL ECG dataset is a large dataset of 21837 clinical 12-lead ECGs from 18885 patients of 10 second length. The raw waveform data was annotated by up to two cardiologists, who assigned potentially multiple ECG statements to each record. The in total 71 different ECG statements conform to the SCP-ECG standard and cover diagnostic, form, and rhythm statements. To ensure comparability of machine learning algorithms trained on the dataset, we provide recommended splits into training and test sets. In combination with the extensive annotation, this turns the dataset into a rich resource for the training and the evaluation of automatic ECG interpretation algorithms. The dataset is complemented by extensive metadata on demographics, infarction characteristics, likelihoods for diagnostic ECG statements as well as annotated signal properties. #Background The waveform data underlying the PTB-XL ECG dataset was collected with devices from Schiller AG over the course of nearly seven years between October 1989 and June 1996. With the acquisition of the original database from Schiller AG, the full usage rights were transferred to the PTB. The records were curated and converted into a structured database within a long-term project at the Physikalisch-Technische Bundesanstalt (PTB). The database was used in a number of publications, see e.g. [1,2], but the access remained restricted until now. The Institutional Ethics Committee approved the publication of the anonymous data in an open-access database (PTB-2020-1). During the public release process in 2019, the existing database was streamlined with particular regard to usability and accessibility for the machine learning community. Waveform and metadata were converted to open data formats that can easily processed by standard software. #Methods ##Data Acquisition 1. Raw signal data was recorded and stored in a proprietary compressed format. For all signals, we provide the standard set of 12 leads (I, II, III, AVL, AVR, AVF, V1, ..., V6) with reference electrodes on the right arm. 2. The corresponding general metadata (such as age, sex, weight and height) was collected in a database. 3. Each record was annotated with a report string (generated by cardiologist or automatic interpretation by ECG-device) which was converted into a standardized set of SCP-ECG statements (scp_codes). For most records also the heart’s axis (`heart_axis`) and infarction stadium (`infarction_stadium1` and `infarction_stadium2`, if present) were extracted. 4. A large fraction of the records was validated by a second cardiologist. 5. All records were validated by a technical expert focusing mainly on signal characteristics. ##Data Preprocessing ECGs and patients are identified by unique identifiers (`ecg_id` and `patient_id`). Personal information in the metadata, such as names of validating cardiologists, nurses and recording site (hospital etc.) of the recording was pseudonymized. The date of birth only as age at the time of the ECG recording, where ages of more than 89 years appear in the range of 300 years in compliance with HIPAA standards. Furthermore, all ECG recording dates were shifted by a random offset for each patient. The ECG statements used for annotating the records follow the SCP-ECG standard [3]. ##Data Description In general, the dataset is organized as follows: ``` ptbxl ├── ptbxl_database.csv ├── scp_statements.csv ├── records100 ├── 00000 │ │ ├── 00001_lr.dat │ │ ├── 00001_lr.hea │ │ ├── ... │ │ ├── 00999_lr.dat │ │ └── 00999_lr.hea │ ├── ... │ └── 21000 │ ├── 21001_lr.dat │ ├── 21001_lr.hea │ ├── ... │ ├── 21837_lr.dat │ └── 21837_lr.hea └── records500 ├── 00000 │ ├── 00001_hr.dat │ ├── 00001_hr.hea │ ├── ... │ ├── 00999_hr.dat │ └── 00999_hr.hea ├── ... └── 21000 ├── 21001_hr.dat ├── 21001_hr.hea ├── ... ├── 21837_hr.dat └── 21837_hr.hea ``` The dataset comprises 21837 clinical 12-lead ECG records of 10 seconds length from 18885 patients, where 52% are male and 48% are female with ages covering the whole range from 0 to 95 years (median 62 and interquantile range of 22). The value of the dataset results from the comprehensive collection of many different co-occurring pathologies, but also from a large proportion of healthy control samples. The distribution of diagnosis is as follows, where we restrict for simplicity to diagnostic statements aggregated into superclasses (note: sum of statements exceeds the number of records because of potentially multiple labels per record): ``` Records | Superclass | Description 9528 | NORM | Normal ECG 5486 | MI | Myocardial Infarction 5250 | STTC | ST/T Change 4907 | CD | Conduction Disturbance 2655 | HYP | Hypertrophy ``` The waveform files are stored in WaveForm DataBase (WFDB) format with 16 bit precision at a resolution of 1μV/LSB and a sampling frequency of 500Hz (records500/). For the user’s convenience we also release a downsampled versions of the waveform data at a sampling frequency of 100Hz (records100/). All relevant metadata is stored in ptbxl_database.csv with one row per record identified by ecg_id. It contains 28 columns that can be categorized into: 1. Identifiers: Each record is identified by a unique `ecg_id`. The corresponding patient is encoded via patient_id. The paths to the original record (500 Hz) and a downsampled version of the record (100 Hz) are stored in `filename_hr` and `filename_lr`. 2. General Metadata: demographic and recording metadata such as age, sex, height, weight, nurse, site, device and recording_date 3. ECG statements: core components are `scp_codes` (SCP-ECG statements as a dictionary with entries of the form statement: `likelihood`, where likelihood is set to 0 if unknown) and `report` (report string). Additional fields are `heart_axis`, `infarction_stadium1`, `infarction_stadium2`, `validated_by`, `second_opinion`, `initial_autogenerated_report` and `validated_by_human`. 4. Signal Metadata: signal quality such as noise (`static_noise` and `burst_noise`), baseline drifts (`baseline_drift`) and other artifacts such as `electrodes_problems`. We also provide `extra_beats` for counting extra systoles and pacemaker for signal patterns indicating an active pacemaker. 5. Cross-validation Folds: recommended 10-fold train-test splits (`strat_fold`) obtained via stratified sampling while respecting patient assignments, i.e. all records of a particular patient were assigned to the same fold. Records in fold 9 and 10 underwent at least one human evaluation and are therefore of a particularly high label quality. We therefore propose to use folds 1-8 as training set, fold 9 as validation set and fold 10 as test set. All information related to the used annotation scheme is stored in a dedicated `scp_statements.csv` that was enriched with mappings to other annotation standards such as AHA, aECGREFID, CDISC and DICOM. We provide additional side-information such as the category each statement can be assigned to (diagnostic, form and/or rhythm). For diagnostic statements, we also provide a proposed hierarchical organization into `diagnostic_class` and `diagnostic_subclass`. ##Usage Notes In `example_physionet.py` we provide a minimal usage example that shows how to load waveform data (numpy-arrays `X_train` and `X_test`) and labels (`y_train` and `y_test`) making use of the proposed train-test split. For illustration, we use diagnostic subclass statements as labels based on the assignments in `scp_statements.csv`.

{'#Abstract': '源自 https://physionet.org/content/ptb-xl/1.0.1/ #摘要心电图（ECG）是评估患者心脏状况的关键诊断工具。自动ECG解释算法作为诊断辅助系统，有望为医务人员带来巨大的缓解——仅基于常规进行的ECG数量。然而，此类算法的开发需要大量的训练数据集和清晰的基准程序。在我们看来，现有可自由获取的ECG数据集在这两方面都无法得到满意的覆盖。 PTB-XL ECG数据集是一个包含21837例临床12导联心电图的庞大数据集，时长为10秒，来自18885名患者。原始波形数据由最多两名心内科医生标注，为每条记录分配了可能多个ECG陈述。总计71种不同的ECG陈述符合SCP-ECG标准，涵盖了诊断、形态和节律陈述。为确保在数据集上训练的机器学习算法的可比性，我们提供了推荐的训练集和测试集划分。结合广泛的标注，这使得该数据集成为自动ECG解释算法训练和评估的宝贵资源。该数据集还包括关于人口统计学、梗死特征、诊断ECG陈述的可能性以及标注信号特性的详尽元数据。 #背景 PTB-XL ECG数据集所依据的波形数据由Schiller AG提供，采集时间跨越了近七年的1989年10月至1996年6月。通过获得Schiller AG的原数据库，PTB获得了完整的使用权。这些记录在物理技术联邦研究所（PTB）的长期项目中进行了整理和转换为结构化数据库。该数据库被用于多项出版物中，例如[1,2]，但访问一直受到限制。机构伦理委员会批准了匿名数据在开放获取数据库（PTB-2020-1）中的发布。在2019年的公开发布过程中，现有的数据库针对机器学习社区的可用性和易用性进行了精简。波形和元数据被转换为标准软件可以轻松处理的开放数据格式。 #方法 ##数据获取 1. 原始信号数据以专有压缩格式记录并存储。对于所有信号，我们提供了标准12导联（I、II、III、AVL、AVR、AVF、V1、...、V6）以及右臂上的参考电极。 2. 相应的一般元数据（如年龄、性别、体重和身高）收集在数据库中。 3. 每条记录都附有报告字符串（由心内科医生生成或由ECG设备自动解释），并将其转换为标准化的SCP-ECG陈述（scp_codes）。对于大多数记录，还提取了心脏轴（`heart_axis`）和梗死阶段（`infarction_stadium1`和`infarction_stadium2`，如果存在）。 4. 大部分记录由第二名心内科医生进行了验证。 5. 所有记录均由技术专家进行了验证，主要关注信号特性。 ##数据预处理 ECGs和患者由唯一的标识符（`ecg_id`和`patient_id`）识别。元数据中的个人信息（如验证心内科医生、护士和记录地点的名称）进行了匿名处理。出生日期仅作为ECG记录时的年龄，其中超过89岁的年龄出现在300年的范围内，符合HIPAA标准。此外，所有ECG记录日期都针对每位患者进行了随机偏移。用于标注记录的ECG陈述遵循SCP-ECG标准[3]。 ##数据描述总的来说，数据集的组织结构如下： ptbxl ├── ptbxl_database.csv ├── scp_statements.csv ├── records100 ├── 00000 │ │ ├── 00001_lr.dat │ │ ├── 00001_lr.hea │ │ ├── ... │ │ ├── 00999_lr.dat │ │ └── 00999_lr.hea │ ├── ... │ └── 21000 │ ├── 21001_lr.dat │ ├── 21001_lr.hea │ ├── ... │ ├── 21837_lr.dat │ └── 21837_lr.hea └── records500 ├── 00000 │ ├── 00001_hr.dat │ ├── 00001_hr.hea │ ├── ... │ ├── 00999_hr.dat │ └── 00999_hr.hea ├── ... └── 21000 ├── 21001_hr.dat ├── 21001_hr.hea ├── ... ├── 21837_hr.dat └── 21837_hr.hea 该数据集包含21837例来自18885名患者的10秒长临床12导联心电图，其中52%为男性，48%为女性，年龄范围从0岁到95岁（中位数为62岁，四分位距为22岁）。数据集的价值源于对许多不同共病病理的综合收集，同时也源于大量健康对照组样本。诊断分布如下，我们为了简化起见，仅限于将诊断陈述聚合到超级类别中（注意：陈述的总数超过了记录数，因为每条记录可能有多个标签）：记录 | 超级类别 | 描述 9528 | NORM | 正常ECG 5486 | MI | 心肌梗死 5250 | STTC | ST/T改变 4907 | CD | 传导障碍 2655 | HYP | 肥厚波形文件以WaveForm DataBase（WFDB）格式存储，具有16位精度，分辨率为1μV/LSB，采样频率为500Hz（records500/）。为了方便用户，我们还发布了波形数据的降采样版本，采样频率为100Hz（records100/）。所有相关元数据存储在ptbxl_database.csv中，每条记录由ecg_id唯一标识。它包含28列，可以分为以下类别： 1. 标识符：每条记录由唯一的`ecg_id`标识。相应的患者通过`patient_id`进行编码。原始记录（500 Hz）和降采样版本记录（100 Hz）的路径存储在`filename_hr`和`filename_lr`中。 2. 一般元数据：人口统计学和记录元数据，如年龄、性别、身高、体重、护士、地点、设备和记录日期。 3. ECG陈述：核心组件是`scp_codes`（SCP-ECG陈述作为字典，条目形式为陈述：`可能性`，其中可能性在未知时设置为0）和`report`（报告字符串）。其他字段还包括`heart_axis`、`infarction_stadium1`、`infarction_stadium2`、`validated_by`、`second_opinion`、`initial_autogenerated_report`和`validated_by_human`。 4. 信号元数据：信号质量，如噪声（`static_noise`和`burst_noise`）、基线漂移（`baseline_drift`）以及其他伪影，如`electrodes_problems`。我们还提供了`extra_beats`用于计数额外心搏和起搏器，以指示活动起搏器的信号模式。 5. 交叉验证折：推荐的10折训练-测试划分（`strat_fold`）是通过分层抽样获得的，同时尊重患者分配，即同一患者的所有记录都被分配到同一个折中。第9折和第10折的记录至少经过一次人工评估，因此标签质量特别高。因此，我们建议使用第1-8折作为训练集，第9折作为验证集，第10折作为测试集。所有与使用的标注方案相关的信息都存储在专门的`scp_statements.csv`中，该文件已添加到其他标注标准（如AHA、aECGREFID、CDISC和DICOM）的映射。我们还提供了额外的辅助信息，例如每个陈述可以分配到的类别（诊断、形态和/或节律）。对于诊断陈述，我们还提供了一个建议的层次组织到`diagnostic_class`和`diagnostic_subclass`中。 ##使用说明在`example_physionet.py`中，我们提供了一个最小使用示例，展示了如何使用建议的训练-测试划分加载波形数据（numpy-arrays `X_train`和`X_test`）和标签（`y_train`和`y_test`）。为了说明，我们使用诊断子类别陈述作为标签，基于`scp_statements.csv`中的分配。 '}

提供机构：

www.kaggle.com

搜集汇总

数据集介绍

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集