google/red_ace_asr_error_detection_and_correction
收藏RED-ACE 数据集概述
数据集简介
RED-ACE 数据集用于训练和评估 ASR 错误检测或纠正 模型。该数据集基于 LibriSpeech 语料库,包含带有标注的转录错误。
数据集详细信息
- 转录方法:使用 Google Cloud Speech-to-Text API 的默认和视频模型对 LibriSpeech 语料库进行解码。
- 错误标注:通过将假设转录与参考转录对齐,计算最小编辑距离,标注出插入、删除和替换错误。
- 数据格式:包含训练、开发和测试集,数据以 JSON 行格式存储。
数据格式
数据包含以下键:
"id":LibriSpeech 的 ID。"truth":LibriSpeech 的参考(正确)转录。"asr_model":用于转录的 ASR 模型。"librispeech_pool":LibriSpeech 数据中的原始池(分割)。"asr_hypothesis":转录假设。"confidence_scores":转录假设中的词级置信度分数。"error_labels":错误标签(1 表示错误,0 表示无错误)。
示例数据
json { "id": "test-other/6070/86744/6070-86744-0024", "truth": "my dear franz replied albert when upon receipt of my letter you found the necessity of asking the counts assistance you promptly went to him saying my friend albert de morcerf is in danger help me to deliver him", "asr_model": "default", "librispeech_pool": "other", "asr_hypothesis": ["my", "dear", "friends", "replied", "Albert", "received", "my", "letter", "you", "found", "the", "necessity", "of", "asking", "the", "county", "assistance", "you", "promptly", "went", "to", "him", "saying", "my", "friend", "all", "but", "the", "most", "stuff", "is", "in", "danger", "help", "me", "to", "deliver", "it"], "confidence_scores": ["0.9876290559768677", "0.9875272512435913", "0.6921446323394775", "0.9613730311393738", "0.9413103461265564", "0.6563355922698975", "0.9876290559768677", "0.9876290559768677", "0.9876290559768677", "0.9876290559768677", "0.9876290559768677", "0.9876290559768677", "0.9876290559768677", "0.9876290559768677", "0.9876290559768677", "0.9876290559768677", "0.9876290559768677", "0.9876290559768677", "0.9876290559768677", "0.9876290559768677", "0.9876290559768677", "0.9876290559768677", "0.9876290559768677", "0.9876290559768677", "0.9876290559768677", "1.0", "1.0", "1.0", "1.0", "1.0", "0.9876290559768677", "0.9876290559768677", "0.9876290559768677", "0.9876290559768677", "0.9876290559768677", "0.9876290559768677", "0.5291957855224609", "0.5291957855224609"], "error_labels": ["0", "0", "1", "0", "0", "1", "0", "0", "0", "0", "0", "0", "0", "0", "0", "1", "0", "0", "0", "0", "0", "0", "0", "0", "0", "1", "1", "1", "1", "1", "0", "0", "0", "0", "0", "0", "0", "1"] }




