SONIVA database: Speech recognition validation in aphasia
收藏DataCite Commons2026-03-27 更新2026-05-03 收录
下载链接:
https://helix.imperial.ac.uk/doi/10.82186/bg9km-knm44
下载链接
链接失效反馈官方服务:
资源简介:
Update: File access for this version has been restricted. Please access the latest version of the dataset to download the files: https://doi.org/10.82186/ra9dv-4az59
SONIVA (Speech recOgNItion Validation in Aphasia) is a large-scale English speech corpus designed to support research in post-stroke communication disorders and clinical automatic speech recognition (ASR). The corpus contains transcript from stroke survivors and non-stroke control participants, collected as part of two UK studies: the Imperial Comprehensive Cognitive Assessment in Cerebrovascular Disease (IC3) and Predicting Language Outcome and Recovery after Stroke (PLORAS). All data collection protocols received ethical approval from the UK Health Research Authority (IC3: IRAS 299333; PLORAS: IRAS 133939, REC reference 13/LO/1515).
The SONIVA database includes transcriptions of both long-form and short-form speech tasks. Long-form speech consists of two picture description tasks, including a Comprehensive Aphasia Test (CAT) picture description and an additional in-house beach scene description used in the IC3 study. Short-form speech comprises three single-word tasks from IC3, Naming, Reading, and Repetition, which assess semantic access (naming), orthographic-to-phonological mapping (reading), and auditory phonological processing (repetition). This version only entails long-form transcript.
Clinical and demographic metadata, which varies depending on the database source (IC3 or PLORAS) is additionally provided. These include:
• Demographics: age, sex, handedness, accent, years of education, and estimated speech and language therapy hours.
• Language and cognitive assessments: Comprehensive Aphasia Test (CAT) scores (including naming, comprehension, and syntactic metrics) and related linguistic measures.
• Lesion metrics: brain lesion volumes segmented by region (e.g., frontal, parietal, and subcortical) extracted from neuroimaging.
• Stroke and vascular information: stroke etiology, lesion site, vascular territories, vessel occlusion, and carotid or vascular imaging modality.
• Clinical scores: NIHSS (11 subcategories), MoCA, mRS, GRIP, and STAR scores collected across multiple follow-up sessions.
• Comorbidities and cardiovascular risk factors: hypertension, high cholesterol, diabetes, atrial fibrillation, previous stroke, smoking history, and other vascular or cognitive risk factors.
The SONIVA 2026 Zip folder provided in this repository contains transcript files (in CHAT format) and such associated metadata. These files contain non-identifiable and non-sensitive data, carefully curated to remove any personally identifiable information, and are made available for direct download without restriction for research purposes.
Additionally, SONIVA contains matched audio recordings from which the transcriptions are derived. Nevertheless, due to the sensitivity of the speech audio data, these files are not suitable for public sharing and are available upon request only. Imperial will review individual data access requests and will execute appropriate institutional data sharing agreements depending on the legal status and location of the data requestor.
The following conditions must be met:
i) The academic requester should have appropriate ethical clearance according to their local regulations,
ii) The data must be stored in a secure academic research environment,
iii) There should be no attempt to re-identify participants,
iv) In the unlikely event that a participant is identified through the audio recordings, the data controller must be informed.
Access will be granted via a secure institutional cloud link. This process ensures compliance with UK GDPR and the ethical standards governing sensitive biometric data.
提供机构:
Imperial College London
创建时间:
2026-03-24



