PandhereAnu/telehealth-pii-dataset
收藏Hugging Face2026-03-24 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/PandhereAnu/telehealth-pii-dataset
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: tokens
list: string
- name: ner_tags
list:
class_label:
names:
'0': '0'
'1': B-PATIENT
'2': I-PATIENT
'3': B-DOCTOR
'4': I-DOCTOR
'5': B-MRN
'6': I-MRN
'7': B-PHONE
'8': I-PHONE
'9': B-DATE
'10': I-DATE
splits:
- name: train
num_bytes: 483917
num_examples: 1600
- name: validation
num_bytes: 60549
num_examples: 200
- name: test
num_bytes: 60358
num_examples: 200
download_size: 607550
dataset_size: 604824
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
- split: validation
path: data/validation-*
- split: test
path: data/test-*
---
# Telehealth PII Dataset
A synthetic dataset for training NER models to detect
and redact HIPAA-sensitive PII from telehealth transcripts.
## Dataset Description
Custom built dataset with 1600 labeled sentences covering
real-world telehealth scenarios. Created because real
patient data is protected under HIPAA and cannot be
shared publicly.
## Dataset Structure
| Split | Size |
|------------|------|
| Train | 1280 |
| Validation | 160 |
| Test | 160 |
## Features
- `tokens` — list of words in each sentence
- `ner_tags` — BIO labels for each token
## Label Classes
| Label | Description |
|------------|------------------------|
| O | Not PII |
| B/I-PATIENT| Patient name |
| B/I-DOCTOR | Provider name |
| B/I-MRN | Medical record number |
| B/I-PHONE | Phone number |
| B/I-DATE | Appointment/birth date |
## How to Use
```python
from datasets import load_dataset
dataset = load_dataset("PandhereAnu/telehealth-pii-dataset")
print(dataset)
```
## Scenarios Covered
- Receptionist to patient calls
- Doctor scheduling notes
- Pharmacy and billing calls
- Prescription refill reminders
- Hospital discharge summaries
- Emergency ward checkups
- Insurance form calls
- Nurse patient reminders
## Intended Use
Training NER models for healthcare transcript
de-identification and HIPAA compliance automation.
## Limitations
- Synthetic data only
- English language
- Limited PII variety per template
提供机构:
PandhereAnu



