five

red_ace_asr_error_detection_and_correction

收藏
魔搭社区2025-12-05 更新2025-04-26 收录
下载链接:
https://modelscope.cn/datasets/google/red_ace_asr_error_detection_and_correction
下载链接
链接失效反馈
官方服务:
资源简介:
# **RED-ACE** ## Dataset Summary This dataset can be used to train and evaluate **ASR Error Detection or Correction** models. It was introduced in the [RED-ACE paper (Gekhman et al, 2022)](https://aclanthology.org/2022.emnlp-main.180.pdf). The dataset contains ASR outputs on the LibriSpeech corpus [(Panayotov et al., 2015)](https://ieeexplore.ieee.org/document/7178964) with annotated transcription errors. ## Dataset Details The LibriSpeech corpus was decoded using [Google Cloud Speech-to-Text API](https://cloud.google.com/speech-to-text), with the **default** and **video** [models](https://cloud.google.com/speech-to-text/docs/speech-to-text-requests#select-model). The [word-level confidence](https://cloud.google.com/speech-to-text/docs/word-confidence#word-level_confidence) was enabled and is provided as part of the transcription hypothesis. To annotate word-level errors (for the error detection task), the hypothesis words were aligned with the reference (correct) transcription to find an edit path (insertions, deletions and substitutions) with the minimum edit distance (from the hypothesis to the reference). The hypothesis words with deletions and substitutions were then labeled as ERROR (1), the rest were labeled as NOTERROR (0). ## Data format The dataset has train, developement and test splits which correspond to the splits in Librispeech. The data contains json lines with the following keys (note that asr_hypothesis[i], confidence_scores[i] and error_labels[i] correpond to the same word): - `"id"` - The librispeech id. - `"truth"` - The reference (correct) transcript from Librispeech. - `"asr_model"` - The ASR [model](https://cloud.google.com/speech-to-text/docs/speech-to-text-requests#select-model) used for transcription. - `"librispeech_pool"`: Corresponds to the original pool (split) in the librispeech data. - `"asr_hypothesis"` - The transcription hypothesis. - `"confidence_scores"` - The [word-level confidence scores](https://cloud.google.com/speech-to-text/docs/word-confidence#word-level_confidence) provided as part of the transcription hypothesis. - `"error_labels"` - The error labels (1 error, 0 not error) that were obtained by alighning the hypothesis and the reference. Here is an example of a single data item: ```json { "id": "test-other/6070/86744/6070-86744-0024", "truth": "my dear franz replied albert when upon receipt of my letter you found the necessity of asking the count's assistance you promptly went to him saying my friend albert de morcerf is in danger help me to deliver him", "asr_model": "default", "librispeech_pool": "other", "asr_hypothesis": ["my", "dear", "friends", "replied", "Albert", "received", "my", "letter", "you", "found", "the", "necessity", "of", "asking", "the", "county", "assistance", "you", "promptly", "went", "to", "him", "saying", "my", "friend", "all", "but", "the", "most", "stuff", "is", "in", "danger", "help", "me", "to", "deliver", "it"], "confidence_scores": ["0.9876290559768677", "0.9875272512435913", "0.6921446323394775", "0.9613730311393738", "0.9413103461265564", "0.6563355922698975", "0.9876290559768677", "0.9876290559768677", "0.9876290559768677", "0.9876290559768677", "0.9876290559768677", "0.9876290559768677", "0.9876290559768677", "0.9876290559768677", "0.9876290559768677", "0.9876290559768677", "0.9876290559768677", "0.9876290559768677", "0.9876290559768677", "0.9876290559768677", "0.9876290559768677", "0.9876290559768677", "0.9876290559768677", "0.9876290559768677", "0.9876290559768677", "1.0", "1.0", "1.0", "1.0", "1.0", "0.9876290559768677", "0.9876290559768677", "0.9876290559768677", "0.9876290559768677", "0.9876290559768677", "0.9876290559768677", "0.5291957855224609", "0.5291957855224609"], "error_labels": ["0", "0", "1", "0", "0", "1", "0", "0", "0", "0", "0", "0", "0", "0", "0", "1", "0", "0", "0", "0", "0", "0", "0", "0", "0", "1", "1", "1", "1", "1", "0", "0", "0", "0", "0", "0", "0", "1"] } ``` ## Loading the dataset The following code loads the dataset and locates the example data item from above: ```python from datasets import load_dataset red_ace_data = load_dataset("google/red_ace_asr_error_detection_and_correction", split='test') for example in red_ace_data: if example['id'] == 'test-other/6070/86744/6070-86744-0024': break print(example) ``` ## Citation If you use this dataset for a research publication, please cite the **RED-ACE paper** (using the bibtex entry below), as well as the **Librispeech paper** mentioned above. ``` @inproceedings{gekhman-etal-2022-red, title = "{RED}-{ACE}: Robust Error Detection for {ASR} using Confidence Embeddings", author = "Gekhman, Zorik and Zverinski, Dina and Mallinson, Jonathan and Beryozkin, Genady", booktitle = "Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing", month = dec, year = "2022", address = "Abu Dhabi, United Arab Emirates", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2022.emnlp-main.180", doi = "10.18653/v1/2022.emnlp-main.180", pages = "2800--2808", abstract = "ASR Error Detection (AED) models aim to post-process the output of Automatic Speech Recognition (ASR) systems, in order to detect transcription errors. Modern approaches usually use text-based input, comprised solely of the ASR transcription hypothesis, disregarding additional signals from the ASR model. Instead, we utilize the ASR system{'}s word-level confidence scores for improving AED performance. Specifically, we add an ASR Confidence Embedding (ACE) layer to the AED model{'}s encoder, allowing us to jointly encode the confidence scores and the transcribed text into a contextualized representation. Our experiments show the benefits of ASR confidence scores for AED, their complementary effect over the textual signal, as well as the effectiveness and robustness of ACE for combining these signals. To foster further research, we publish a novel AED dataset consisting of ASR outputs on the LibriSpeech corpus with annotated transcription errors.", } ```

# **RED-ACE** ## 数据集概览 本数据集可用于训练与评估**自动语音识别(ASR, Automatic Speech Recognition)错误检测或纠错**模型,其首次提出于《RED-ACE论文(Gekhman等人,2022)》[1]。该数据集包含基于LibriSpeech语料库[(Panayotov等人,2015)][2]生成的自动语音识别输出结果,并附带转录错误标注。 ## 数据集详情 LibriSpeech语料库通过[Google Cloud Speech-to-Text API](https://cloud.google.com/speech-to-text)进行解码,使用了**默认**与**视频**两种[模型](https://cloud.google.com/speech-to-text/docs/speech-to-text-requests#select-model)。实验启用了[词级置信度(word-level confidence)](https://cloud.google.com/speech-to-text/docs/word-confidence#word-level_confidence)功能,并将其作为转录假设的一部分提供。为了标注词级错误(用于错误检测任务),研究人员将假设转录文本与参考(正确)转录文本进行对齐,通过最小编辑距离算法计算得到编辑路径(包括插入、删除与替换操作,基于假设文本到参考文本的映射)。其中,被标记为删除或替换的假设词被标注为ERROR(1),其余则标注为NOTERROR(0)。 ## 数据格式 本数据集包含训练集、开发集与测试集三个划分,与LibriSpeech的数据集划分保持一致。数据采用JSON Lines格式存储,各条目包含以下字段(请注意:`asr_hypothesis[i]`、`confidence_scores[i]`与`error_labels[i]`对应同一单词): - `"id"`:LibriSpeech的样本ID。 - `"truth"`:来自LibriSpeech的参考(正确)转录文本。 - `"asr_model"`:用于转录的自动语音识别模型。 - `"librispeech_pool"`:对应LibriSpeech原始数据集的划分池。 - `"asr_hypothesis"`:自动语音识别的转录假设文本(以词列表形式存储)。 - `"confidence_scores"`:转录假设中附带的[词级置信度分数(word-level confidence scores)](https://cloud.google.com/speech-to-text/docs/word-confidence#word-level_confidence)。 - `"error_labels"`:通过对齐假设文本与参考文本得到的错误标注(1代表存在错误,0代表无错误)。 以下为单条数据条目的示例: json { "id": "test-other/6070/86744/6070-86744-0024", "truth": "my dear franz replied albert when upon receipt of my letter you found the necessity of asking the count's assistance you promptly went to him saying my friend albert de morcerf is in danger help me to deliver him", "asr_model": "default", "librispeech_pool": "other", "asr_hypothesis": ["my", "dear", "friends", "replied", "Albert", "received", "my", "letter", "you", "found", "the", "necessity", "of", "asking", "the", "county", "assistance", "you", "promptly", "went", "to", "him", "saying", "my", "friend", "all", "but", "the", "most", "stuff", "is", "in", "danger", "help", "me", "to", "deliver", "it"], "confidence_scores": ["0.9876290559768677", "0.9875272512435913", "0.6921446323394775", "0.9613730311393738", "0.9413103461265564", "0.6563355922698975", "0.9876290559768677", "0.9876290559768677", "0.9876290559768677", "0.9876290559768677", "0.9876290559768677", "0.9876290559768677", "0.9876290559768677", "0.9876290559768677", "0.9876290559768677", "0.9876290559768677", "0.9876290559768677", "0.9876290559768677", "0.9876290559768677", "0.9876290559768677", "0.9876290559768677", "0.9876290559768677", "0.9876290559768677", "0.9876290559768677", "0.9876290559768677", "1.0", "1.0", "1.0", "1.0", "1.0", "0.9876290559768677", "0.9876290559768677", "0.9876290559768677", "0.9876290559768677", "0.9876290559768677", "0.9876290559768677", "0.5291957855224609", "0.5291957855224609"], "error_labels": ["0", "0", "1", "0", "0", "1", "0", "0", "0", "0", "0", "0", "0", "0", "0", "1", "0", "0", "0", "0", "0", "0", "0", "0", "0", "1", "1", "1", "1", "1", "0", "0", "0", "0", "0", "0", "0", "1"] } ## 数据集加载 以下代码可用于加载本数据集,并定位上文展示的示例样本: python from datasets import load_dataset red_ace_data = load_dataset("google/red_ace_asr_error_detection_and_correction", split='test') for example in red_ace_data: if example['id'] == 'test-other/6070/86744/6070-86744-0024': break print(example) ## 引用说明 若您在研究论文中使用本数据集,请引用**RED-ACE论文**(如下方的BibTeX条目),同时一并引用上文提及的**LibriSpeech论文**。 bibtex @inproceedings{gekhman-etal-2022-red, title = "{RED}-{ACE}: Robust Error Detection for {ASR} using Confidence Embeddings", author = "Gekhman, Zorik and Zverinski, Dina and Mallinson, Jonathan and Beryozkin, Genady", booktitle = "2022年自然语言处理经验方法会议论文集", month = dec, year = "2022", address = "阿联酋阿布扎比", publisher = "国际计算语言学协会", url = "https://aclanthology.org/2022.emnlp-main.180", doi = "10.18653/v1/2022.emnlp-main.180", pages = "2800--2808", abstract = "自动语音识别错误检测(AED)模型旨在对自动语音识别(ASR)系统的输出进行后处理,以识别转录错误。现有方法通常仅使用基于ASR转录假设文本的输入,忽略了来自ASR模型的额外信号。本文则利用ASR系统的词级置信度分数来提升AED模型的性能。具体而言,我们在AED模型的编码器中加入了ASR置信度嵌入(ACE, ASR Confidence Embedding)层,从而能够将置信度分数与转录文本联合编码为上下文表征。实验结果表明,ASR置信度分数对AED任务具有增益效果,且其与文本信号具有互补性,同时验证了ACE层在融合这些信号时的有效性与鲁棒性。为推动相关研究,我们发布了一个全新的AED数据集,该数据集包含基于LibriSpeech语料库生成的ASR输出结果,并附带转录错误标注。", }
提供机构:
maas
创建时间:
2025-04-21
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作