mythicinfinity/librispeech-pc
收藏Hugging Face2026-03-24 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/mythicinfinity/librispeech-pc
下载链接
链接失效反馈官方服务:
资源简介:
---
pretty_name: 'LibriSpeech-PC'
annotations_creators:
- machine-generated
language_creators:
- crowdsourced
- expert-generated
language:
- en
license:
- cc-by-4.0
multilinguality:
- monolingual
source_datasets:
- extended
task_categories:
- automatic-speech-recognition
dataset_info:
- config_name: 'clean'
features:
- name: file
dtype: string
- name: audio
dtype:
audio:
sampling_rate: 16000
- name: text
dtype: string
- name: text_raw
dtype: string
- name: text_normalized
dtype: string
- name: speaker_id
dtype: int64
- name: chapter_id
dtype: int64
- name: id
dtype: string
splits:
- name: 'train.100'
num_bytes: 5419059024
- name: 'train.360'
num_bytes: 21129046690
- name: 'validation'
num_bytes: 311726621
- name: 'test'
num_bytes: 319785733
- config_name: 'other'
features:
- name: file
dtype: string
- name: audio
dtype:
audio:
sampling_rate: 16000
- name: text
dtype: string
- name: text_raw
dtype: string
- name: text_normalized
dtype: string
- name: speaker_id
dtype: int64
- name: chapter_id
dtype: int64
- name: id
dtype: string
splits:
- name: 'train.500'
num_bytes: 27664027828
- name: 'validation'
num_bytes: 292740028
- name: 'test'
num_bytes: 317438639
- config_name: 'all'
features:
- name: file
dtype: string
- name: audio
dtype:
audio:
sampling_rate: 16000
- name: text
dtype: string
- name: text_raw
dtype: string
- name: text_normalized
dtype: string
- name: speaker_id
dtype: int64
- name: chapter_id
dtype: int64
- name: id
dtype: string
splits:
- name: 'train.clean.100'
num_bytes: 5419059024
- name: 'train.clean.360'
num_bytes: 21129046690
- name: 'train.other.500'
num_bytes: 27664027828
- name: 'validation.clean'
num_bytes: 311726621
- name: 'validation.other'
num_bytes: 292740028
- name: 'test.clean'
num_bytes: 319785733
- name: 'test.other'
num_bytes: 317438639
configs:
- config_name: 'clean'
data_files:
- split: 'test'
path: 'clean/test/*.parquet'
- split: 'train.100'
path: 'clean/train.100/*.parquet'
- split: 'train.360'
path: 'clean/train.360/*.parquet'
- split: 'validation'
path: 'clean/validation/*.parquet'
- config_name: 'other'
data_files:
- split: 'test'
path: 'other/test/*.parquet'
- split: 'train.500'
path: 'other/train.500/*.parquet'
- split: 'validation'
path: 'other/validation/*.parquet'
- config_name: 'all'
default: true
data_files:
- split: 'test.clean'
path: 'all/test.clean/*.parquet'
- split: 'test.other'
path: 'all/test.other/*.parquet'
- split: 'train.clean.100'
path: 'all/train.clean.100/*.parquet'
- split: 'train.clean.360'
path: 'all/train.clean.360/*.parquet'
- split: 'train.other.500'
path: 'all/train.other.500/*.parquet'
- split: 'validation.clean'
path: 'all/validation.clean/*.parquet'
- split: 'validation.other'
path: 'all/validation.other/*.parquet'
---
# Dataset Card for LibriSpeech-PC
## Dataset Description
- **Homepage:** https://www.openslr.org/145/
- **Source Audio:** https://www.openslr.org/12
- **Repository:** https://huggingface.co/datasets/openslr/librispeech_asr
- **Language:** English
- **License:** CC BY 4.0
### Dataset Summary
LibriSpeech-PC is a parquet-backed merge of `openslr/librispeech_asr` audio metadata with SLR145 punctuation/capitalization manifests. It preserves the original LibriSpeech config/split layout (`clean`, `other`, `all`) and adds punctuation/casing targets.
## Dataset Structure
### Data Fields
- `file`: path to the original LibriSpeech audio file.
- `audio`: `Audio` feature with 16kHz sampling rate.
- `text`: punctuated + cased transcript from LibriSpeech-PC manifests.
- `text_raw`: raw transcript from LibriSpeech-PC manifests.
- `text_normalized`: original normalized LibriSpeech ASR transcript.
- `speaker_id`: speaker identifier.
- `chapter_id`: chapter identifier.
- `id`: utterance identifier.
### Data Splits
Split names and configs mirror `openslr/librispeech_asr`. Some rows may be absent because the SLR145 manifests drop samples during punctuation/capitalization restoration.
## Additional Information
### Citation Information
```bibtex
@article{meister2023librispeechpc,
title={LibriSpeech-PC: Benchmark for Evaluation of Punctuation and Capitalization Capabilities of end-to-end ASR Models},
author={Meister, A. and Novikov, M. and Karpov, N. and Bakhturina, E. and Lavrukhin, V. and Ginsburg, B.},
journal={arXiv preprint arXiv:2310.02943},
year={2023}
}
@inproceedings{panayotov2015librispeech,
title={LibriSpeech: An ASR corpus based on public domain audio books},
author={Panayotov, V. and Chen, G. and Povey, D. and Khudanpur, S.},
booktitle={ICASSP},
year={2015},
doi={10.1109/ICASSP.2015.7178964}
}
```
### Source Links
- LibriSpeech-PC (SLR145): https://www.openslr.org/145/
- LibriSpeech (SLR12): https://www.openslr.org/12
- Hugging Face LibriSpeech ASR parquet source: https://huggingface.co/datasets/openslr/librispeech_asr
提供机构:
mythicinfinity



