ekacare/eka-medical-asr-evaluation-dataset

Name: ekacare/eka-medical-asr-evaluation-dataset
Creator: ekacare
Published: 2025-07-30 13:21:04
License: 暂无描述

Hugging Face2025-07-30 更新2026-02-07 收录

下载链接：

https://hf-mirror.com/datasets/ekacare/eka-medical-asr-evaluation-dataset

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit dataset_info: - config_name: en features: - name: md5_text dtype: string - name: file_name dtype: string - name: audio dtype: audio: sampling_rate: 16000 - name: md5_audio dtype: string - name: duration dtype: float32 - name: text dtype: string - name: audio_language dtype: string - name: text_language dtype: string - name: session_id dtype: string - name: speaker dtype: string - name: type_concept dtype: string - name: recording_context dtype: string - name: medical_entities dtype: string splits: - name: test num_bytes: 1809500000 num_examples: 3619 download_size: 1538075000 dataset_size: 1809500000 - config_name: hi features: - name: md5_text dtype: string - name: file_name dtype: string - name: audio dtype: audio: sampling_rate: 16000 - name: md5_audio dtype: string - name: duration dtype: float32 - name: text dtype: string - name: audio_language dtype: string - name: text_language dtype: string - name: session_id dtype: string - name: speaker dtype: string - name: type_concept dtype: string - name: recording_context dtype: string - name: medical_entities dtype: string splits: - name: test num_bytes: 160000000 num_examples: 320 download_size: 136000000 dataset_size: 160000000 configs: - config_name: en default_preview_rows: 100 # Add this default: true data_files: - split: test path: en/test-* - config_name: hi default_preview_rows: 100 # Add this data_files: - split: test path: hi/test-* task_categories: - automatic-speech-recognition - text-to-speech language: - en tags: - dataset - audio - speech - asr pretty_name: Eka Medical ASR Evaluation Dataset size_categories: - 1K<n<10K --- # Eka Medical ASR Evaluation Dataset ## Dataset Overview and Sourcing The Eka Medical ASR Evaluation Dataset enables comprehensive evaluation of automatic speech recognition systems designed to transcribe medical speech into accurate text—a fundamental component of any medical scribe system. This dataset captures the unique challenges of processing medical terminology, particularly branded drugs, which is specific to the Indian context. The dataset comprises over 3,900+ curated audio recordings featuring medical terminology delivered in various speaking styles, including isolated medical entities, narrated medical sentences, and impromptu medical conversation. The dataset includes approximately 3,600 English recordings and 320 Hindi recordings. We intend to keep improving and growing this dataset for different languages and scenarios. For details read this [blog](https://info.eka.care/services/advancing-healthcare-ai-evaluation-in-india-ekacare-releases-four-evaluation-datasets). ## Data Collection and Quality Assurance All audio recordings capture natural speech patterns, ensuring realistic evaluation scenarios. A significant portion of the dataset originates from EkaCare's internal team members through narrated medical text sessions and recorded EkaScribe demonstration sessions with our internal medical professionals. Additional high-quality content was sourced from speakers across four different medical colleges, providing diverse regional accents and speaking styles representative of India's medical education landscape. ## Target Applications This dataset is valuable for developers building and evaluating voice-enabled healthcare applications, and medical documentation systems that rely on speech-to-text functionality. Healthcare institutions implementing AI-powered scribe solutions will find this dataset essential for evaluating system performance across diverse Indian linguistic contexts. ## Usage Load specific subset and split: ```python from datasets import load_dataset # Load specific subset and split dataset = load_dataset('ekacare/eka-medical-asr-evaluation-dataset', 'en', split='test') # Load all splits from a subset dataset = load_dataset('ekacare/eka-medical-asr-evaluation-dataset', 'en') # Load everything dataset = load_dataset('ekacare/eka-medical-asr-evaluation-dataset') ``` ## Performance Benchmarks (english-only) The following table shows the performance of various ASR models evaluated on the Eka Medical ASR Evaluation Dataset. All metrics are reported on the test set. | Models | WER | CER | semWER | kwWER | |--------|-----|-----|--------|-------| | AWS transcribe | 0.183 | 0.074 | 0.111 | 0.122 | | GPT-4o | 0.161 | 0.097 | 0.116 | 0.117 | | Gemini 2.0 Flash | 0.175 | 0.082 | 0.105 | 0.101 | | Gemini 2.5 Flash | 0.148 | 0.055 | 0.072 | 0.068 | | Eleven Labs (Scribe V1) | 0.186 | 0.087 | 0.102 | 0.093 | | Whisper V3 large | 0.157 | 0.056 | 0.089 | 0.085 | | Bhashini ASR | 0.199 | 0.093 | 0.123 | 0.114 | | Parrotlet-a-en-5b | **0.109** | **0.047** | **0.072** | **0.062** | [Parrotlet-a-en-5b](https://huggingface.co/ekacare/parrotlet-a-en-5b) is an open-weight model released by EkaCare for english language ASR in medical domain ### Evaluation Metrics - **WER (Word Error Rate)**: Measures the percentage of words that are incorrectly transcribed - **CER (Character Error Rate)**: Measures the percentage of characters that are incorrectly transcribed - **semWER (Semantic Word Error Rate)**: Evaluates transcription accuracy considering semantic/phonetic equivalences - **kwWER (Keyword Word Error Rate)**: Focuses on the accuracy of medical keywords and terminology For more details on semWER and its importance in medical ASR evaluation: - **Description**: [Beyond Traditional WER: The Critical Need for Semantic WER in ASR for Indian Healthcare](http://info.eka.care/services/beyond-traditional-wer-the-critical-need-for-semantic-wer-in-asr-for-indian-healthcare) - **Implementation**: [ASR Semantic Metrics - KARMA OpenMedEvalKit](https://github.com/eka-care/KARMA-OpenMedEvalKit/blob/main/karma/metrics/asr/asr_semantic_metrics.py) ## Dataset Structure ### Subsets This dataset includes the following subsets: - **en**: 3,619 samples - test: 3,619 samples - **hi**: 320 samples - test: 320 samples ### Data Fields The dataset includes the following columns: - **md5_text**: String md5 of the ground truth text - **file_name**: String filename of the audio file - **audio**: Audio data (16kHz sampling rate) - **md5_audio**: String md5 of the audio file - **duration**: Float32 duration of the audio - **text**: String Ground truth text - **audio_language**: String language of the speech - **text_language**: String language in the text (could differ from the speech in case of translation task) - **session_id**: String identifier of the session - **speaker**: String speaker id - **type_concept**: String type of medical concept - **recording_context**: String context of the recording (single entity narration, sentence narration or conversation) - **medical_entities**: String Offsets of medical entities along with their type ## Technical Details - **Total samples**: 3,939 - **Shard length**: 500 - **Number of subsets**: 2 - **Number of splits**: 2 - **Audio format**: 16kHz sampling rate ## Contributors / Annotators list - Dr Anushree Rana - Dr Rajshree Badami - Dr Arun Kumar R - Neha Ramesh Badge - Dr Kashika Singh - Dr Rishi Srivathsav - Dr Arun Kumar - Dr Sanjana SN ## License This dataset is released under the MIT License, enabling broad use while maintaining attribution requirements.

提供机构：

ekacare

5,000+

优质数据集

54 个

任务类型

进入经典数据集