bene-ges/wiki-en-asr-adapt

Name: bene-ges/wiki-en-asr-adapt
Creator: bene-ges
Published: 2023-12-14 10:59:19
License: 暂无描述

Hugging Face2023-12-14 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/bene-ges/wiki-en-asr-adapt

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-sa-4.0 language: - en size_categories: - 10M<n<100M --- This is the dataset presented in my [ASRU-2023 paper](https://arxiv.org/abs/2309.17267). It consists of multiple files: Keys2Paragraphs.txt (internal name in scripts: yago_wiki.txt): 4.3 million unique words/phrases (English Wikipedia titles or their parts) occurring in 33.8 million English Wikipedia paragraphs. Keys2Corruptions.txt (internal name in scripts: sub_misspells.txt): 26 million phrase pairs in the corrupted phrase inventory, as recognized by different ASR models Keys2Related.txt (internal name in scripts: related_phrases.txt): 62.7 million phrase pairs in the related phrase inventory FalsePositives.txt (internal name in scripts: false_positives.txt): 449 thousand phrase pairs in the false positive phrase inventory NgramMappings.txt (internal name in scripts: replacement_vocab_filt.txt): 5.5 million character n-gram mappings dictionary asr outputs of g2p+tts+asr using 4 different ASR systems (conformer ctc was used twice), gives pairs of initial phrase and its recognition result. Does not include .wav files, but these can be reproduced by feeding g2p to tts giza raw outputs of GIZA++ alignments for each corpus, from these we get NgramMappings.txt and Keys2Corruptions.txt This [example code](https://github.com/bene-ges/nemo_compatible/blob/spellmapper_new_false_positive_sampling/scripts/nlp/en_spellmapper/dataset_preparation/build_training_data_from_wiki_en_asr_adapt.sh) shows how to generate training data from this dataset.

提供机构：

bene-ges

原始信息汇总

数据集概述

该数据集包含多个文件，具体如下：

Keys2Paragraphs.txt (内部脚本名：yago_wiki.txt)
- 包含430万个独特的单词/短语（来自英文维基百科标题或其部分），出现在3380万个英文维基百科段落中。
Keys2Corruptions.txt (内部脚本名：sub_misspells.txt)
- 包含2600万个短语对，这些短语对在不同的ASR模型中被识别为损坏的短语库存。
Keys2Related.txt (内部脚本名：related_phrases.txt)
- 包含6270万个短语对，这些短语对在相关短语库存中。
FalsePositives.txt (内部脚本名：false_positives.txt)
- 包含44.9万个短语对，这些短语对在假阳性短语库存中。
NgramMappings.txt (内部脚本名：replacement_vocab_filt.txt)
- 包含550万个字符n-gram映射字典。
asr
- 包含使用4种不同ASR系统（其中conformer ctc被使用了两次）的g2p+tts+asr输出，提供初始短语及其识别结果的配对。不包括.wav文件，但可以通过将g2p输入到tts中来重现。
giza
- 包含每个语料库的GIZA++对齐的原始输出，从中可以获取NgramMappings.txt和Keys2Corruptions.txt。

5,000+

优质数据集

54 个

任务类型

进入经典数据集