asr-alignment
收藏魔搭社区2025-05-05 更新2025-03-15 收录
下载链接:
https://modelscope.cn/datasets/pengzhendong/asr-alignment
下载链接
链接失效反馈官方服务:
资源简介:
# Speech Recognition Alignment Dataset
This dataset is a variation of several widely-used ASR datasets, encompassing Librispeech, MuST-C, TED-LIUM, VoxPopuli, Common Voice, and GigaSpeech. The difference is this dataset includes:
- Precise alignment between audio and text.
- Text that has been punctuated and made case-sensitive.
- Identification of named entities in the text.
# Usage
First, install the latest version of the 🤗 Datasets package:
```bash
pip install --upgrade pip
pip install --upgrade datasets[audio]
```
The dataset can be downloaded and pre-processed on disk using the [`load_dataset`](https://huggingface.co/docs/datasets/v2.14.5/en/package_reference/loading_methods#datasets.load_dataset)
function:
```python
from datasets import load_dataset
# Available dataset: 'libris','mustc','tedlium','voxpopuli','commonvoice','gigaspeech'
dataset = load_dataset("nguyenvulebinh/asr-alignment", "libris")
# take the first sample of the validation set
sample = dataset["train"][0]
```
It can also be streamed directly from the Hub using Datasets' [streaming mode](https://huggingface.co/blog/audio-datasets#streaming-mode-the-silver-bullet).
Loading a dataset in streaming mode loads individual samples of the dataset at a time, rather than downloading the entire
dataset to disk:
```python
from datasets import load_dataset
dataset = load_dataset("nguyenvulebinh/asr-alignment", "libris", streaming=True)
# take the first sample of the validation set
sample = next(iter(dataset["train"]))
```
## Citation
If you use this data, please consider citing the [ICASSP 2024 Paper: SYNTHETIC CONVERSATIONS IMPROVE MULTI-TALKER ASR]():
```
@INPROCEEDINGS{synthetic-multi-asr-nguyen,
author={Nguyen, Thai-Binh and Waibel, Alexander},
booktitle={ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
title={SYNTHETIC CONVERSATIONS IMPROVE MULTI-TALKER ASR},
year={2024},
volume={},
number={},
}
```
## License
This dataset is licensed in accordance with the terms of the original dataset.
# 语音识别对齐数据集(Speech Recognition Alignment Dataset)
本数据集为多款主流自动语音识别(Automatic Speech Recognition,ASR)数据集的衍生版本,涵盖Librispeech、MuST-C、TED-LIUM、VoxPopuli、Common Voice及GigaSpeech。其核心差异在于本数据集包含:
- 音频与文本间的精确对齐标注
- 经过标点规范化且区分大小写的文本标注
- 文本中的命名实体识别标注
## 使用方法
首先,安装最新版本的🤗 数据集(Datasets)库:
bash
pip install --upgrade pip
pip install --upgrade datasets[audio]
可通过[`load_dataset`](https://huggingface.co/docs/datasets/v2.14.5/en/package_reference/loading_methods#datasets.load_dataset)函数实现数据集的下载与本地预处理:
python
from datasets import load_dataset
# 可选数据集名称:'libris','mustc','tedlium','voxpopuli','commonvoice','gigaspeech'
dataset = load_dataset("nguyenvulebinh/asr-alignment", "libris")
# 获取验证集的第一条样本
sample = dataset["train"][0]
也可通过数据集库的[流式加载模式](https://huggingface.co/blog/audio-datasets#streaming-mode-the-silver-bullet)直接从Hugging Face Hub流式读取数据。流式加载模式会单次加载单条数据样本,而非将完整数据集下载至本地磁盘:
python
from datasets import load_dataset
dataset = load_dataset("nguyenvulebinh/asr-alignment", "libris", streaming=True)
# 获取验证集的第一条样本
sample = next(iter(dataset["train"]))
## 引用方式
若您使用本数据集,请引用以下[ICASSP 2024论文:SYNTHETIC CONVERSATIONS IMPROVE MULTI-TALKER ASR]():
@INPROCEEDINGS{synthetic-multi-asr-nguyen,
author={Nguyen, Thai-Binh and Waibel, Alexander},
booktitle={ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
title={SYNTHETIC CONVERSATIONS IMPROVE MULTI-TALKER ASR},
year={2024},
volume={},
number={},
}
## 授权协议
本数据集的授权协议遵循其原始数据集的相关条款。
提供机构:
maas
创建时间:
2025-03-12



