five

distil-whisper/ami-sdm-timestamped

收藏
Hugging Face2023-09-25 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/distil-whisper/ami-sdm-timestamped
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 task_categories: - automatic-speech-recognition language: - en -pretty_name: AMI SDM --- # Distil Whisper: AMI SDM With Timestamps This is a variant of the [AMI SDM](https://huggingface.co/datasets/edinburghstr/ami) dataset, augmented to return the pseudo-labelled Whisper Transcriptions alongside the original dataset elements. The pseudo-labelled transcriptions were generated by labelling the input audio data with the Whisper [large-v2](https://huggingface.co/openai/whisper-large-v2) model with *greedy* sampling and timestamp prediction. For information on how the original dataset was curated, refer to the original [dataset card](https://huggingface.co/datasets/edinburghstr/ami). ## Standalone Usage First, install the latest version of the 🤗 Datasets package: ```bash pip install --upgrade pip pip install --upgrade datasets[audio] ``` The dataset can be downloaded and pre-processed on disk using the [`load_dataset`](https://huggingface.co/docs/datasets/v2.14.5/en/package_reference/loading_methods#datasets.load_dataset) function: ```python from datasets import load_dataset dataset = load_dataset("distil-whisper/ami-sdm", "sdm") # take the first sample of the validation set sample = dataset["validation"][0] ``` It can also be streamed directly from the Hub using Datasets' [streaming mode](https://huggingface.co/blog/audio-datasets#streaming-mode-the-silver-bullet). Loading a dataset in streaming mode loads individual samples of the dataset at a time, rather than downloading the entire dataset to disk: ```python from datasets import load_dataset dataset = load_dataset("distil-whisper/ami-sdm", "sdm", streaming=True) # take the first sample of the validation set sample = next(iter(dataset["validation"])) ``` ## Distil Whisper Usage To use this dataset to reproduce a Distil Whisper training run, refer to the instructions on the [Distil Whisper repository](https://github.com/huggingface/distil-whisper#training). ## License This dataset is licensed under cc-by-4.0.

This is a variant of the AMI SDM dataset, augmented with pseudo-labelled Whisper transcriptions. These transcriptions were generated using the Whisper large-v2 model with greedy sampling and timestamp prediction. The dataset supports automatic speech recognition tasks and includes audio data in English. It can be used in both standalone and streaming modes for various applications, including training models like Distil Whisper.
提供机构:
distil-whisper
原始信息汇总

Distil Whisper: AMI SDM With Timestamps

数据集概述

  • 名称: Distil Whisper: AMI SDM With Timestamps
  • 类型: 自动语音识别数据集
  • 语言: 英语
  • 许可证: cc-by-4.0

数据集描述

该数据集是AMI SDM数据集的变体,增加了伪标记的Whisper转录本以及原始数据集元素。伪标记转录本是通过使用Whisper large-v2模型进行贪婪采样和时间戳预测生成的。

使用方法

独立使用

  1. 安装最新版本的🤗 Datasets包: bash pip install --upgrade pip pip install --upgrade datasets[audio]

  2. 使用load_dataset函数下载和预处理数据集: python from datasets import load_dataset

    dataset = load_dataset("distil-whisper/ami-sdm", "sdm") sample = dataset["validation"][0]

  3. 也可以通过流模式直接从Hub加载数据集: python from datasets import load_dataset

    dataset = load_dataset("distil-whisper/ami-sdm", "sdm", streaming=True) sample = next(iter(dataset["validation"]))

Distil Whisper 使用

如需使用此数据集复现Distil Whisper训练运行,请参考Distil Whisper仓库中的说明。

许可证

该数据集遵循cc-by-4.0许可证。

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作