SUPERSEDED - The Edinburgh International Accents of English Corpus

Name: SUPERSEDED - The Edinburgh International Accents of English Corpus
Creator: University of Edinburgh. School of Informatics. The Institute for Language, Cognition and Computation. The Centre for Speech Technology Research
Published: 2025-04-11 13:05:07
License: 暂无描述

DataCite Commons2025-04-11 更新2025-04-17 收录

下载链接：

https://datashare.ed.ac.uk/handle/10283/8911

下载链接

链接失效反馈

官方服务：

资源简介：

## This item has been replaced by the one which can be found at [https://doi.org/10.7488/ds/7914] ##English is the most widely spoken language in the world, used daily by millions of people as a first or second language in many different contexts. As a result, there are many varieties of English. Although the great many advances in English automatic speech recognition (ASR) over the past decades, results are usually reported based on test datasets which fail to represent the diversity of English as spoken today around the globe. We present the first release of The Edinburgh International Accents of English Corpus (EdAcc). This dataset attempts to better represent the wide diversity of English, encompassing almost 40 hours of dyadic video call conversations between friends. Unlike other datasets, EdAcc includes a wide range of first and second-language varieties of English and a linguistic background profile of each speaker. Results on latest public, and commercial models show that EdAcc highlights shortcomings of current English ASR models. The best performing model, trained on 680 thousand hours of transcribed data, obtains an average of 19.7% WER -- in contrast to the the 2.7% WER obtained when evaluated on US English clean read speech. Across all models, we observe a drop in performance on Jamaican, Indonesian, Nigerian, and Kenyan English speakers. Recordings, linguistic backgrounds, data statement, and evaluation scripts are released on our website under CC-BY-SA.

本条目已替换为可在[https://doi.org/10.7488/ds/7914]获取的版本。英语是全球使用范围最广的语言，每日有数以百万计的人群在各类场景中将其作为第一或第二语言使用，因此衍生出诸多变体。尽管近几十年来英语自动语音识别（Automatic Speech Recognition, ASR）领域取得了诸多进展，但现有研究的评测结果通常基于未能体现当今全球英语口语多样性的测试数据集。我们首次发布了爱丁堡国际英语口音语料库（Edinburgh International Accents of English Corpus, EdAcc）。该语料库旨在更全面地覆盖英语的多样变体，包含近40小时的好友间双向视频通话对话内容。与其他语料库不同，EdAcc涵盖了丰富的英语第一语言与第二语言变体，并附带每位说话者的语言背景档案。针对当前主流公开与商用模型的评测结果表明，EdAcc揭示了现有英语自动语音识别模型的诸多缺陷。在基于68万小时转录数据训练的最优模型上，该语料库的平均词错误率（Word Error Rate, WER）为19.7%；而在标准美式英语清晰朗读语音的评测中，该模型的词错误率仅为2.7%。所有测试模型在针对牙买加、印尼、尼日利亚与肯尼亚英语使用者的评测中，均出现了性能下滑。本语料库的录音数据、说话者语言背景信息、数据集说明与评测脚本均已在我们的官网以CC-BY-SA协议发布。

提供机构：

University of Edinburgh. School of Informatics. The Institute for Language, Cognition and Computation. The Centre for Speech Technology Research

创建时间：

2024-12-03