distil-whisper/gigaspeech-l-token-ids

Name: distil-whisper/gigaspeech-l-token-ids
Creator: distil-whisper
Published: 2023-10-11 09:44:39
License: 暂无描述

Hugging Face2023-10-11 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/distil-whisper/gigaspeech-l-token-ids

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: other task_categories: - automatic-speech-recognition language: - en extra_gated_prompt: |- SpeechColab does not own the copyright of the audio files. For researchers and educators who wish to use the audio files for non-commercial research and/or educational purposes, we can provide access through the Hub under certain conditions and terms. Terms of Access: The "Researcher" has requested permission to use the GigaSpeech database (the "Database") at Tsinghua University. In exchange for such permission, Researcher hereby agrees to the following terms and conditions: 1. Researcher shall use the Database only for non-commercial research and educational purposes. 2. The SpeechColab team and Tsinghua University make no representations or warranties regarding the Database, including but not limited to warranties of non-infringement or fitness for a particular purpose. 3. Researcher accepts full responsibility for his or her use of the Database and shall defend and indemnify the SpeechColab team and Tsinghua University, including their employees, Trustees, officers and agents, against any and all claims arising from Researcher's use of the Database, including but not limited to Researcher's use of any copies of copyrighted audio files that he or she may create from the Database. 4. Researcher may provide research associates and colleagues with access to the Database provided that they first agree to be bound by these terms and conditions. 5. The SpeechColab team and Tsinghua University reserve the right to terminate Researcher's access to the Database at any time. 6. If Researcher is employed by a for-profit, commercial entity, Researcher's employer shall also be bound by these terms and conditions, and Researcher hereby represents that he or she is fully authorized to enter into this agreement on behalf of such employer. Please also fill out the Google Form https://forms.gle/UuGQAPyscGRrUMLq6 to request access to the GigaSpeech dataset. extra_gated_fields: Name: text Email: text Organization: text Address: text I hereby confirm that I have requested access via the Google Form provided above: checkbox I accept the terms of access: checkbox --- # Distil Whisper: GigaSpeech This is a variant of the [GigaSpeech](https://huggingface.co/datasets/speechcolab/gigaspeech) dataset, augmented to return the pseudo-labelled Whisper Transcriptions alongside the original dataset elements. The pseudo-labelled transcriptions were generated by labelling the input audio data with the Whisper [large-v2](https://huggingface.co/openai/whisper-large-v2) model with *greedy* sampling. For information on how the original dataset was curated, refer to the original [dataset card](https://huggingface.co/datasets/speechcolab/gigaspeech). ## Standalone Usage First, install the latest version of the 🤗 Datasets package: ```bash pip install --upgrade pip pip install --upgrade datasets[audio] ``` The dataset can be downloaded and pre-processed on disk using the [`load_dataset`](https://huggingface.co/docs/datasets/v2.14.5/en/package_reference/loading_methods#datasets.load_dataset) function: ```python from datasets import load_dataset dataset = load_dataset("distil-whisper/gigaspeech-l", "l") # take the first sample of the validation set sample = dataset["validation"][0] ``` It can also be streamed directly from the Hub using Datasets' [streaming mode](https://huggingface.co/blog/audio-datasets#streaming-mode-the-silver-bullet). Loading a dataset in streaming mode loads individual samples of the dataset at a time, rather than downloading the entire dataset to disk: ```python from datasets import load_dataset dataset = load_dataset("distil-whisper/gigaspeech-l", "l", streaming=True) # take the first sample of the validation set sample = next(iter(dataset["validation"])) ``` ## Distil Whisper Usage To use this dataset to reproduce a Distil Whisper training run, refer to the instructions on the [Distil Whisper repository](https://github.com/huggingface/distil-whisper#training). ## License This dataset is licensed under custom terms. To view the custom license for this dataset, refer to the original [dataset card](https://huggingface.co/datasets/speechcolab/gigaspeech).

提供机构：

distil-whisper

原始信息汇总

Distil Whisper: GigaSpeech

这是一个GigaSpeech数据集的变体，增加了返回伪标记的Whisper转录本的功能。伪标记转录本是通过使用Whisper large-v2模型对输入音频数据进行贪婪采样生成的。有关原始数据集的制作信息，请参考原始数据集卡片。

独立使用

首先，安装最新版本的🤗 Datasets包：

bash pip install --upgrade pip pip install --upgrade datasets[audio]

可以使用load_dataset函数下载和预处理数据集：

python from datasets import load_dataset

dataset = load_dataset("distil-whisper/gigaspeech-l", "l")

获取验证集的第一个样本

sample = dataset["validation"][0]

也可以直接从Hub使用Datasets的流模式进行流式加载：

python from datasets import load_dataset

dataset = load_dataset("distil-whisper/gigaspeech-l", "l", streaming=True)

获取验证集的第一个样本

sample = next(iter(dataset["validation"]))

Distil Whisper 使用

要使用此数据集重现Distil Whisper训练运行，请参考Distil Whisper仓库中的说明。

许可证

此数据集受自定义条款许可。要查看此数据集的自定义许可证，请参考原始数据集卡片。

5,000+

优质数据集

54 个

任务类型

进入经典数据集