five

PeacefulData/HyPoradise-pilot

收藏
Hugging Face2024-04-29 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/PeacefulData/HyPoradise-pilot
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 dataset_info: features: - name: hypothesis sequence: string - name: transcription dtype: string - name: input1 dtype: string - name: hypothesis_concatenated dtype: string - name: source dtype: string - name: id dtype: string - name: dummy_str dtype: string - name: dummy_list sequence: 'null' - name: prompt dtype: string splits: - name: train num_bytes: 469026382 num_examples: 286366 - name: test num_bytes: 24101011 num_examples: 18237 download_size: 125245485 dataset_size: 493127393 configs: - config_name: default data_files: - split: train path: data/train-* - split: test path: data/test-* --- # Dataset Name: Pilot dataset for Multi-domain ASR corrections <p align="center"> <img src="hypilot.jpg" height ="100"> </p> ## Description This dataset is a pilot version of a larger dataset for automatic speech recognition (ASR) corrections across multiple domains. It contains paired hypotheses and corrected transcriptions for various ASR tasks consolidated from [PeacefulData/HyPoradise-v0](https://huggingface.co/datasets/PeacefulData/HyPoradise-v0) ## Structure ### Data Split The dataset is divided into training and test splits: - Training Data: 281,082 entries - Approximately 6,255,198 tokens for transcriptions - Approximately 31,211,083 tokens for concatenated hypotheses - Test Data: 16,108 entries - Approximately 327,750 tokens for transcriptions - Approximately 1,629,093 tokens for concatenated hypotheses ### Columns - `hypothesis`: N-best hypothesis from beam search. - `transcription`: Corrected asr transcription. - `hypothesis_concatenated`: An alternative version of the text output. - `source`: The source of the text entry, indicating the origin dataset. - `prompt`: Instructional prompt for correction task - `score`: An acoustic model score (not all entries have this). ### Source Datasets The dataset combines entries from various sources: - **Training Sources**: - `train_td3`: 50,000 entries - `train_other_500`: 50,000 entries - `train_cv`: 47,293 entries - `train_lrs2`: 42,940 entries - `train_wsj_score`: 37,514 entries - `train_swbd`: 36,539 entries - `train_chime4`: 9,600 entries - `train_atis`: 3,964 entries - `train_coraal`: 3,232 entries - **Test Sources**: - `test_ls_other`: 2,939 entries - `test_ls_clean`: 2,620 entries - `test_lrs2`: 2,259 entries - `test_swbd`: 2,000 entries - `test_cv`: 2,000 entries - `test_chime4`: 1,320 entries - `test_td3`: 1,155 entries - `test_wsj_score`: 836 entries - `test_atis`: 809 entries - `test_coraal`: 170 entries ## Access The dataset can be accessed and downloaded through the HuggingFace Datasets library. Use the following command to load the dataset: ```python from datasets import load_dataset dataset = load_dataset("PeacefulData/HyPoradise-pilot") ``` ## Acknowledgments This dataset is consolidated from the PeacefulData/HyPoradise-v0 dataset. Thanks to the original creators for making this data available. ### References ```bib @inproceedings{yang2023generative, title={Generative speech recognition error correction with large language models and task-activating prompting}, author={Yang, Chao-Han Huck and Gu, Yile and Liu, Yi-Chieh and Ghosh, Shalini and Bulyko, Ivan and Stolcke, Andreas}, booktitle={2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)}, pages={1--8}, year={2023}, organization={IEEE} } ``` ```bib @inproceedings{chen2023hyporadise, title={HyPoradise: An Open Baseline for Generative Speech Recognition with Large Language Models}, author={CHEN, CHEN and Hu, Yuchen and Yang, Chao-Han Huck and Siniscalchi, Sabato Marco and Chen, Pin-Yu and Chng, Ensiong}, booktitle={Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track}, year={2023} } ```
提供机构:
PeacefulData
原始信息汇总

数据集名称:多领域ASR校正的试点数据集

描述

该数据集是用于跨多个领域的自动语音识别(ASR)校正的大型数据集的试点版本。它包含来自PeacefulData/HyPoradise-v0的配对假设和校正转录。

结构

数据分割

数据集分为训练和测试两部分:

  • 训练数据:281,082条记录
    • 转录大约有6,255,198个词
    • 连接的假设大约有31,211,083个词
  • 测试数据:16,108条记录
    • 转录大约有327,750个词
    • 连接的假设大约有1,629,093个词

  • hypothesis:来自波束搜索的N-最佳假设。
  • transcription:校正后的ASR转录。
  • hypothesis_concatenated:文本输出的替代版本。
  • source:文本条目的来源,指示原始数据集。
  • prompt:校正任务的指导提示。
  • score:声学模型分数(并非所有条目都有此项)。

来源数据集

数据集结合了来自多个来源的条目:

  • 训练来源
    • train_td3:50,000条记录
    • train_other_500:50,000条记录
    • train_cv:47,293条记录
    • train_lrs2:42,940条记录
    • train_wsj_score:37,514条记录
    • train_swbd:36,539条记录
    • train_chime4:9,600条记录
    • train_atis:3,964条记录
    • train_coraal:3,232条记录
  • 测试来源
    • test_ls_other:2,939条记录
    • test_ls_clean:2,620条记录
    • test_lrs2:2,259条记录
    • test_swbd:2,000条记录
    • test_cv:2,000条记录
    • test_chime4:1,320条记录
    • test_td3:1,155条记录
    • test_wsj_score:836条记录
    • test_atis:809条记录
    • test_coraal:170条记录

访问

该数据集可以通过HuggingFace Datasets库访问和下载。使用以下命令加载数据集:

python from datasets import load_dataset dataset = load_dataset("PeacefulData/HyPoradise-pilot")

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作