distil-whisper/gigaspeech-l-token-ids
收藏Hugging Face2023-10-11 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/distil-whisper/gigaspeech-l-token-ids
下载链接
链接失效反馈官方服务:
资源简介:
---
license: other
task_categories:
- automatic-speech-recognition
language:
- en
extra_gated_prompt: |-
SpeechColab does not own the copyright of the audio files. For researchers and educators who wish to use the audio files for non-commercial research and/or educational purposes, we can provide access through the Hub under certain conditions and terms.
Terms of Access:
The "Researcher" has requested permission to use the GigaSpeech database (the "Database") at Tsinghua University. In exchange for such permission, Researcher hereby agrees to the following terms and conditions:
1. Researcher shall use the Database only for non-commercial research and educational purposes.
2. The SpeechColab team and Tsinghua University make no representations or warranties regarding the Database, including but not limited to warranties of non-infringement or fitness for a particular purpose.
3. Researcher accepts full responsibility for his or her use of the Database and shall defend and indemnify the SpeechColab team and Tsinghua University, including their employees, Trustees, officers and agents, against any and all claims arising from Researcher's use of the Database, including but not limited to Researcher's use of any copies of copyrighted audio files that he or she may create from the Database.
4. Researcher may provide research associates and colleagues with access to the Database provided that they first agree to be bound by these terms and conditions.
5. The SpeechColab team and Tsinghua University reserve the right to terminate Researcher's access to the Database at any time.
6. If Researcher is employed by a for-profit, commercial entity, Researcher's employer shall also be bound by these terms and conditions, and Researcher hereby represents that he or she is fully authorized to enter into this agreement on behalf of such employer.
Please also fill out the Google Form https://forms.gle/UuGQAPyscGRrUMLq6 to request access to the GigaSpeech dataset.
extra_gated_fields:
Name: text
Email: text
Organization: text
Address: text
I hereby confirm that I have requested access via the Google Form provided above: checkbox
I accept the terms of access: checkbox
---
# Distil Whisper: GigaSpeech
This is a variant of the [GigaSpeech](https://huggingface.co/datasets/speechcolab/gigaspeech) dataset, augmented to return the pseudo-labelled Whisper
Transcriptions alongside the original dataset elements. The pseudo-labelled transcriptions were generated by
labelling the input audio data with the Whisper [large-v2](https://huggingface.co/openai/whisper-large-v2)
model with *greedy* sampling. For information on how the original dataset was curated, refer to the original
[dataset card](https://huggingface.co/datasets/speechcolab/gigaspeech).
## Standalone Usage
First, install the latest version of the 🤗 Datasets package:
```bash
pip install --upgrade pip
pip install --upgrade datasets[audio]
```
The dataset can be downloaded and pre-processed on disk using the [`load_dataset`](https://huggingface.co/docs/datasets/v2.14.5/en/package_reference/loading_methods#datasets.load_dataset)
function:
```python
from datasets import load_dataset
dataset = load_dataset("distil-whisper/gigaspeech-l", "l")
# take the first sample of the validation set
sample = dataset["validation"][0]
```
It can also be streamed directly from the Hub using Datasets' [streaming mode](https://huggingface.co/blog/audio-datasets#streaming-mode-the-silver-bullet).
Loading a dataset in streaming mode loads individual samples of the dataset at a time, rather than downloading the entire
dataset to disk:
```python
from datasets import load_dataset
dataset = load_dataset("distil-whisper/gigaspeech-l", "l", streaming=True)
# take the first sample of the validation set
sample = next(iter(dataset["validation"]))
```
## Distil Whisper Usage
To use this dataset to reproduce a Distil Whisper training run, refer to the instructions on the
[Distil Whisper repository](https://github.com/huggingface/distil-whisper#training).
## License
This dataset is licensed under custom terms. To view the custom license for this dataset, refer to the original [dataset card](https://huggingface.co/datasets/speechcolab/gigaspeech).
提供机构:
distil-whisper
原始信息汇总
Distil Whisper: GigaSpeech
这是一个GigaSpeech数据集的变体,增加了返回伪标记的Whisper转录本的功能。伪标记转录本是通过使用Whisper large-v2模型对输入音频数据进行贪婪采样生成的。有关原始数据集的制作信息,请参考原始数据集卡片。
独立使用
首先,安装最新版本的🤗 Datasets包:
bash pip install --upgrade pip pip install --upgrade datasets[audio]
可以使用load_dataset函数下载和预处理数据集:
python from datasets import load_dataset
dataset = load_dataset("distil-whisper/gigaspeech-l", "l")
获取验证集的第一个样本
sample = dataset["validation"][0]
也可以直接从Hub使用Datasets的流模式进行流式加载:
python from datasets import load_dataset
dataset = load_dataset("distil-whisper/gigaspeech-l", "l", streaming=True)
获取验证集的第一个样本
sample = next(iter(dataset["validation"]))
Distil Whisper 使用
要使用此数据集重现Distil Whisper训练运行,请参考Distil Whisper仓库中的说明。
许可证
此数据集受自定义条款许可。要查看此数据集的自定义许可证,请参考原始数据集卡片。



