PeacefulData/HyPoradise-pilot

Name: PeacefulData/HyPoradise-pilot
Creator: PeacefulData
Published: 2024-04-29 05:43:29
License: 暂无描述

Hugging Face2024-04-29 更新2024-06-12 收录

下载链接：

https://hf-mirror.com/datasets/PeacefulData/HyPoradise-pilot

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 dataset_info: features: - name: hypothesis sequence: string - name: transcription dtype: string - name: input1 dtype: string - name: hypothesis_concatenated dtype: string - name: source dtype: string - name: id dtype: string - name: dummy_str dtype: string - name: dummy_list sequence: 'null' - name: prompt dtype: string splits: - name: train num_bytes: 469026382 num_examples: 286366 - name: test num_bytes: 24101011 num_examples: 18237 download_size: 125245485 dataset_size: 493127393 configs: - config_name: default data_files: - split: train path: data/train-* - split: test path: data/test-* --- # Dataset Name: Pilot dataset for Multi-domain ASR corrections <p align="center"> <img src="hypilot.jpg" height ="100"> </p> ## Description This dataset is a pilot version of a larger dataset for automatic speech recognition (ASR) corrections across multiple domains. It contains paired hypotheses and corrected transcriptions for various ASR tasks consolidated from [PeacefulData/HyPoradise-v0](https://huggingface.co/datasets/PeacefulData/HyPoradise-v0) ## Structure ### Data Split The dataset is divided into training and test splits: - Training Data: 281,082 entries - Approximately 6,255,198 tokens for transcriptions - Approximately 31,211,083 tokens for concatenated hypotheses - Test Data: 16,108 entries - Approximately 327,750 tokens for transcriptions - Approximately 1,629,093 tokens for concatenated hypotheses ### Columns - `hypothesis`: N-best hypothesis from beam search. - `transcription`: Corrected asr transcription. - `hypothesis_concatenated`: An alternative version of the text output. - `source`: The source of the text entry, indicating the origin dataset. - `prompt`: Instructional prompt for correction task - `score`: An acoustic model score (not all entries have this). ### Source Datasets The dataset combines entries from various sources: - **Training Sources**: - `train_td3`: 50,000 entries - `train_other_500`: 50,000 entries - `train_cv`: 47,293 entries - `train_lrs2`: 42,940 entries - `train_wsj_score`: 37,514 entries - `train_swbd`: 36,539 entries - `train_chime4`: 9,600 entries - `train_atis`: 3,964 entries - `train_coraal`: 3,232 entries - **Test Sources**: - `test_ls_other`: 2,939 entries - `test_ls_clean`: 2,620 entries - `test_lrs2`: 2,259 entries - `test_swbd`: 2,000 entries - `test_cv`: 2,000 entries - `test_chime4`: 1,320 entries - `test_td3`: 1,155 entries - `test_wsj_score`: 836 entries - `test_atis`: 809 entries - `test_coraal`: 170 entries ## Access The dataset can be accessed and downloaded through the HuggingFace Datasets library. Use the following command to load the dataset: ```python from datasets import load_dataset dataset = load_dataset("PeacefulData/HyPoradise-pilot") ``` ## Acknowledgments This dataset is consolidated from the PeacefulData/HyPoradise-v0 dataset. Thanks to the original creators for making this data available. ### References ```bib @inproceedings{yang2023generative, title={Generative speech recognition error correction with large language models and task-activating prompting}, author={Yang, Chao-Han Huck and Gu, Yile and Liu, Yi-Chieh and Ghosh, Shalini and Bulyko, Ivan and Stolcke, Andreas}, booktitle={2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)}, pages={1--8}, year={2023}, organization={IEEE} } ``` ```bib @inproceedings{chen2023hyporadise, title={HyPoradise: An Open Baseline for Generative Speech Recognition with Large Language Models}, author={CHEN, CHEN and Hu, Yuchen and Yang, Chao-Han Huck and Siniscalchi, Sabato Marco and Chen, Pin-Yu and Chng, Ensiong}, booktitle={Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track}, year={2023} } ```

提供机构：

PeacefulData

原始信息汇总

数据集名称：多领域ASR校正的试点数据集

描述

该数据集是用于跨多个领域的自动语音识别（ASR）校正的大型数据集的试点版本。它包含来自PeacefulData/HyPoradise-v0的配对假设和校正转录。

结构

数据分割

数据集分为训练和测试两部分：

训练数据：281,082条记录
- 转录大约有6,255,198个词
- 连接的假设大约有31,211,083个词
测试数据：16,108条记录
- 转录大约有327,750个词
- 连接的假设大约有1,629,093个词

列

hypothesis：来自波束搜索的N-最佳假设。
transcription：校正后的ASR转录。
hypothesis_concatenated：文本输出的替代版本。
source：文本条目的来源，指示原始数据集。
prompt：校正任务的指导提示。
score：声学模型分数（并非所有条目都有此项）。

来源数据集

数据集结合了来自多个来源的条目：

训练来源：
- train_td3：50,000条记录
- train_other_500：50,000条记录
- train_cv：47,293条记录
- train_lrs2：42,940条记录
- train_wsj_score：37,514条记录
- train_swbd：36,539条记录
- train_chime4：9,600条记录
- train_atis：3,964条记录
- train_coraal：3,232条记录
测试来源：
- test_ls_other：2,939条记录
- test_ls_clean：2,620条记录
- test_lrs2：2,259条记录
- test_swbd：2,000条记录
- test_cv：2,000条记录
- test_chime4：1,320条记录
- test_td3：1,155条记录
- test_wsj_score：836条记录
- test_atis：809条记录
- test_coraal：170条记录

访问

该数据集可以通过HuggingFace Datasets库访问和下载。使用以下命令加载数据集：

python from datasets import load_dataset dataset = load_dataset("PeacefulData/HyPoradise-pilot")

5,000+

优质数据集

54 个

任务类型

进入经典数据集