ysdede/parrot-radiology-asr-en
收藏Hugging Face2025-12-10 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/ysdede/parrot-radiology-asr-en
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: audio
dtype:
audio:
sampling_rate: 16000
- name: transcription
dtype: string
- name: speaker
dtype: string
- name: gender
dtype: string
- name: speed
dtype: float32
- name: volume
dtype: float32
- name: sample_rate
dtype: int32
splits:
- name: test
num_bytes: 95068084
num_examples: 948
- name: train
num_bytes: 758274271
num_examples: 7587
- name: validation
num_bytes: 94576592
num_examples: 949
download_size: 942590176
dataset_size: 947918947
configs:
- config_name: default
data_files:
- split: test
path: data/test-*
- split: train
path: data/train-*
- split: validation
path: data/validation-*
task_categories:
- automatic-speech-recognition
language:
- en
tags:
- medical
license: cc
---
# PARROT Radiology ASR Dataset (Synthetic Speech)
## Dataset Description
This dataset contains synthetic English radiology speech paired with transcriptions. It is designed for training and evaluating radiology-focused Automatic Speech Recognition models, speech LLMs, and multimodal medical AI systems. All audio is generated from the **PARROT v1.0** radiology report corpus, a multilingual collection of fictional reports authored by expert radiologists from 21 countries.
## Dataset Summary
* **Language**: English (translated from 14 languages)
* **Domain**: Medical radiology
* **Task**: Automatic Speech Recognition
* **Audio Duration**: ~55 hours
* **Samples**: 9,484
* **Audio Format**: MP3 VBR q5, 16 kHz mono
* **Speech Generation**: Kokoro TTS 82M v0.1.0
* **File Format**: Parquet
## Splits
| Split | Samples | Duration (h) | Avg Length (s) |
| ---------- | ------- | ------------ | -------------- |
| Train | 7,587 | 43.91 | 20.83 |
| Test | 948 | 5.52 | 20.96 |
| Validation | 949 | 5.49 | 20.82 |
## Dataset Creation
### Text Processing
* Extracted the English translations from PARROT v1.0 JSONL files.
* Cleaned, normalized, and standardized radiology terminology and structural markers.
* Prepared two text forms per report using Gemini 2.0 Flash Thinking:
* standardized written text
* spoken-style, TTS-ready script
### Speech Synthesis
* Generated audio using Kokoro TTS v0.1.0.
* Assigned multiple synthetic speakers across reports.
* Randomized speed and volume for variability.
* Produced continuous WAV files, then chunked into segments under 30 seconds at natural boundaries.
### ASR Alignment
* Matched vocabulary with Nvidia NeMo Parakeet TDT v2.
* Applied normalization rules and markup conversions.
* Verified full compatibility across all 9,484 samples.
### Packaging
* Converted WAV to MP3 VBR q5.
* Created HF dataset with `Audio` feature type.
* Metadata includes speaker, gender, speed, volume, and transcription.
* Splits follow an 80/10/10 ratio with seed 42.
## Dataset Structure
Each record contains:
* **audio**: 16 kHz mono MP3
* **transcription**: text transcription
* **speaker**: synthetic voice ID
* **gender**
* **speed**: speech rate multiplier
* **volume**
* **sample_rate**
## Intended Use
* Training radiology ASR models
* Domain adaptation of general ASR models
* Evaluation of speech LLMs
* Development of multimodal medical AI systems
* Research on synthetic speech pipelines in clinical domains
This dataset is intended for **research use**.
## License
This dataset inherits the **CC BY-NC-SA 4.0** license from PARROT v1.0.
Non-commercial use only. Attribution and share-alike required.
License: [https://creativecommons.org/licenses/by-nc-sa/4.0/](https://creativecommons.org/licenses/by-nc-sa/4.0/)
## Related Source Dataset (Attribution)
This dataset is derived from:
**PARROT v1.0: Polyglot Annotated Radiological Reports for Open Testing**
Multilingual fictional radiology reports authored by 76 radiologists from 21 countries.
Repository: [https://github.com/PARROT-reports/PARROT_v1.0](https://github.com/PARROT-reports/PARROT_v1.0)
License: CC BY-NC-SA 4.0
## Citation
### This Dataset
```bibtex
@dataset{parrot_radiology_asr_synthetic_2024,
title={PARROT Radiology ASR Dataset (Synthetic Speech)},
author={ysdede},
year={2024},
howpublished={\url{[https://huggingface.co/datasets/ysdede/parrot-radiology-asr-en](https://huggingface.co/datasets/ysdede/parrot-radiology-asr-en)}},
note={Synthetic speech dataset derived from PARROT v1.0}
}
````
### PARROT v1.0
```bibtex
@dataset{parrot_v1_2025,
title={PARROT v1.0: Polyglot Annotated Radiological Reports for Open Testing},
author={Le Guellec, Bastien and Bressem, Keno et al.},
year={2025},
howpublished={\url{[https://github.com/PARROT-reports/PARROT_v1.0](https://github.com/PARROT-reports/PARROT_v1.0)}},
note={Multilingual fictional radiology reports authored by 76 radiologists}
}
```
## Acknowledgments
Thanks to the PARROT v1.0 consortium and contributing radiologists.
提供机构:
ysdede



