distil-whisper/ami-sdm-timestamped
收藏Hugging Face2023-09-25 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/distil-whisper/ami-sdm-timestamped
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
task_categories:
- automatic-speech-recognition
language:
- en
-pretty_name: AMI SDM
---
# Distil Whisper: AMI SDM With Timestamps
This is a variant of the [AMI SDM](https://huggingface.co/datasets/edinburghstr/ami) dataset, augmented to return the pseudo-labelled Whisper
Transcriptions alongside the original dataset elements. The pseudo-labelled transcriptions were generated by
labelling the input audio data with the Whisper [large-v2](https://huggingface.co/openai/whisper-large-v2)
model with *greedy* sampling and timestamp prediction. For information on how the original dataset was curated, refer to the original
[dataset card](https://huggingface.co/datasets/edinburghstr/ami).
## Standalone Usage
First, install the latest version of the 🤗 Datasets package:
```bash
pip install --upgrade pip
pip install --upgrade datasets[audio]
```
The dataset can be downloaded and pre-processed on disk using the [`load_dataset`](https://huggingface.co/docs/datasets/v2.14.5/en/package_reference/loading_methods#datasets.load_dataset)
function:
```python
from datasets import load_dataset
dataset = load_dataset("distil-whisper/ami-sdm", "sdm")
# take the first sample of the validation set
sample = dataset["validation"][0]
```
It can also be streamed directly from the Hub using Datasets' [streaming mode](https://huggingface.co/blog/audio-datasets#streaming-mode-the-silver-bullet).
Loading a dataset in streaming mode loads individual samples of the dataset at a time, rather than downloading the entire
dataset to disk:
```python
from datasets import load_dataset
dataset = load_dataset("distil-whisper/ami-sdm", "sdm", streaming=True)
# take the first sample of the validation set
sample = next(iter(dataset["validation"]))
```
## Distil Whisper Usage
To use this dataset to reproduce a Distil Whisper training run, refer to the instructions on the
[Distil Whisper repository](https://github.com/huggingface/distil-whisper#training).
## License
This dataset is licensed under cc-by-4.0.
This is a variant of the AMI SDM dataset, augmented with pseudo-labelled Whisper transcriptions. These transcriptions were generated using the Whisper large-v2 model with greedy sampling and timestamp prediction. The dataset supports automatic speech recognition tasks and includes audio data in English. It can be used in both standalone and streaming modes for various applications, including training models like Distil Whisper.
提供机构:
distil-whisper
原始信息汇总
Distil Whisper: AMI SDM With Timestamps
数据集概述
- 名称: Distil Whisper: AMI SDM With Timestamps
- 类型: 自动语音识别数据集
- 语言: 英语
- 许可证: cc-by-4.0
数据集描述
该数据集是AMI SDM数据集的变体,增加了伪标记的Whisper转录本以及原始数据集元素。伪标记转录本是通过使用Whisper large-v2模型进行贪婪采样和时间戳预测生成的。
使用方法
独立使用
-
安装最新版本的🤗 Datasets包: bash pip install --upgrade pip pip install --upgrade datasets[audio]
-
使用
load_dataset函数下载和预处理数据集: python from datasets import load_datasetdataset = load_dataset("distil-whisper/ami-sdm", "sdm") sample = dataset["validation"][0]
-
也可以通过流模式直接从Hub加载数据集: python from datasets import load_dataset
dataset = load_dataset("distil-whisper/ami-sdm", "sdm", streaming=True) sample = next(iter(dataset["validation"]))
Distil Whisper 使用
如需使用此数据集复现Distil Whisper训练运行,请参考Distil Whisper仓库中的说明。
许可证
该数据集遵循cc-by-4.0许可证。



