ysdede/parrot-radiology-asr-en

Name: ysdede/parrot-radiology-asr-en
Creator: ysdede
Published: 2025-12-10 20:37:28
License: 暂无描述

Hugging Face2025-12-10 更新2025-12-20 收录

下载链接：

https://hf-mirror.com/datasets/ysdede/parrot-radiology-asr-en

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: audio dtype: audio: sampling_rate: 16000 - name: transcription dtype: string - name: speaker dtype: string - name: gender dtype: string - name: speed dtype: float32 - name: volume dtype: float32 - name: sample_rate dtype: int32 splits: - name: test num_bytes: 95068084 num_examples: 948 - name: train num_bytes: 758274271 num_examples: 7587 - name: validation num_bytes: 94576592 num_examples: 949 download_size: 942590176 dataset_size: 947918947 configs: - config_name: default data_files: - split: test path: data/test-* - split: train path: data/train-* - split: validation path: data/validation-* task_categories: - automatic-speech-recognition language: - en tags: - medical license: cc --- # PARROT Radiology ASR Dataset (Synthetic Speech) ## Dataset Description This dataset contains synthetic English radiology speech paired with transcriptions. It is designed for training and evaluating radiology-focused Automatic Speech Recognition models, speech LLMs, and multimodal medical AI systems. All audio is generated from the **PARROT v1.0** radiology report corpus, a multilingual collection of fictional reports authored by expert radiologists from 21 countries. ## Dataset Summary * **Language**: English (translated from 14 languages) * **Domain**: Medical radiology * **Task**: Automatic Speech Recognition * **Audio Duration**: ~55 hours * **Samples**: 9,484 * **Audio Format**: MP3 VBR q5, 16 kHz mono * **Speech Generation**: Kokoro TTS 82M v0.1.0 * **File Format**: Parquet ## Splits | Split | Samples | Duration (h) | Avg Length (s) | | ---------- | ------- | ------------ | -------------- | | Train | 7,587 | 43.91 | 20.83 | | Test | 948 | 5.52 | 20.96 | | Validation | 949 | 5.49 | 20.82 | ## Dataset Creation ### Text Processing * Extracted the English translations from PARROT v1.0 JSONL files. * Cleaned, normalized, and standardized radiology terminology and structural markers. * Prepared two text forms per report using Gemini 2.0 Flash Thinking: * standardized written text * spoken-style, TTS-ready script ### Speech Synthesis * Generated audio using Kokoro TTS v0.1.0. * Assigned multiple synthetic speakers across reports. * Randomized speed and volume for variability. * Produced continuous WAV files, then chunked into segments under 30 seconds at natural boundaries. ### ASR Alignment * Matched vocabulary with Nvidia NeMo Parakeet TDT v2. * Applied normalization rules and markup conversions. * Verified full compatibility across all 9,484 samples. ### Packaging * Converted WAV to MP3 VBR q5. * Created HF dataset with `Audio` feature type. * Metadata includes speaker, gender, speed, volume, and transcription. * Splits follow an 80/10/10 ratio with seed 42. ## Dataset Structure Each record contains: * **audio**: 16 kHz mono MP3 * **transcription**: text transcription * **speaker**: synthetic voice ID * **gender** * **speed**: speech rate multiplier * **volume** * **sample_rate** ## Intended Use * Training radiology ASR models * Domain adaptation of general ASR models * Evaluation of speech LLMs * Development of multimodal medical AI systems * Research on synthetic speech pipelines in clinical domains This dataset is intended for **research use**. ## License This dataset inherits the **CC BY-NC-SA 4.0** license from PARROT v1.0. Non-commercial use only. Attribution and share-alike required. License: [https://creativecommons.org/licenses/by-nc-sa/4.0/](https://creativecommons.org/licenses/by-nc-sa/4.0/) ## Related Source Dataset (Attribution) This dataset is derived from: **PARROT v1.0: Polyglot Annotated Radiological Reports for Open Testing** Multilingual fictional radiology reports authored by 76 radiologists from 21 countries. Repository: [https://github.com/PARROT-reports/PARROT_v1.0](https://github.com/PARROT-reports/PARROT_v1.0) License: CC BY-NC-SA 4.0 ## Citation ### This Dataset ```bibtex @dataset{parrot_radiology_asr_synthetic_2024, title={PARROT Radiology ASR Dataset (Synthetic Speech)}, author={ysdede}, year={2024}, howpublished={\url{[https://huggingface.co/datasets/ysdede/parrot-radiology-asr-en](https://huggingface.co/datasets/ysdede/parrot-radiology-asr-en)}}, note={Synthetic speech dataset derived from PARROT v1.0} } ```` ### PARROT v1.0 ```bibtex @dataset{parrot_v1_2025, title={PARROT v1.0: Polyglot Annotated Radiological Reports for Open Testing}, author={Le Guellec, Bastien and Bressem, Keno et al.}, year={2025}, howpublished={\url{[https://github.com/PARROT-reports/PARROT_v1.0](https://github.com/PARROT-reports/PARROT_v1.0)}}, note={Multilingual fictional radiology reports authored by 76 radiologists} } ``` ## Acknowledgments Thanks to the PARROT v1.0 consortium and contributing radiologists.

提供机构：

ysdede

5,000+

优质数据集

54 个

任务类型

进入经典数据集