mythicinfinity/librispeech-pc

Name: mythicinfinity/librispeech-pc
Creator: mythicinfinity
Published: 2026-03-24 00:03:28
License: 暂无描述

Hugging Face2026-03-24 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/mythicinfinity/librispeech-pc

下载链接

链接失效反馈

官方服务：

资源简介：

--- pretty_name: 'LibriSpeech-PC' annotations_creators: - machine-generated language_creators: - crowdsourced - expert-generated language: - en license: - cc-by-4.0 multilinguality: - monolingual source_datasets: - extended task_categories: - automatic-speech-recognition dataset_info: - config_name: 'clean' features: - name: file dtype: string - name: audio dtype: audio: sampling_rate: 16000 - name: text dtype: string - name: text_raw dtype: string - name: text_normalized dtype: string - name: speaker_id dtype: int64 - name: chapter_id dtype: int64 - name: id dtype: string splits: - name: 'train.100' num_bytes: 5419059024 - name: 'train.360' num_bytes: 21129046690 - name: 'validation' num_bytes: 311726621 - name: 'test' num_bytes: 319785733 - config_name: 'other' features: - name: file dtype: string - name: audio dtype: audio: sampling_rate: 16000 - name: text dtype: string - name: text_raw dtype: string - name: text_normalized dtype: string - name: speaker_id dtype: int64 - name: chapter_id dtype: int64 - name: id dtype: string splits: - name: 'train.500' num_bytes: 27664027828 - name: 'validation' num_bytes: 292740028 - name: 'test' num_bytes: 317438639 - config_name: 'all' features: - name: file dtype: string - name: audio dtype: audio: sampling_rate: 16000 - name: text dtype: string - name: text_raw dtype: string - name: text_normalized dtype: string - name: speaker_id dtype: int64 - name: chapter_id dtype: int64 - name: id dtype: string splits: - name: 'train.clean.100' num_bytes: 5419059024 - name: 'train.clean.360' num_bytes: 21129046690 - name: 'train.other.500' num_bytes: 27664027828 - name: 'validation.clean' num_bytes: 311726621 - name: 'validation.other' num_bytes: 292740028 - name: 'test.clean' num_bytes: 319785733 - name: 'test.other' num_bytes: 317438639 configs: - config_name: 'clean' data_files: - split: 'test' path: 'clean/test/*.parquet' - split: 'train.100' path: 'clean/train.100/*.parquet' - split: 'train.360' path: 'clean/train.360/*.parquet' - split: 'validation' path: 'clean/validation/*.parquet' - config_name: 'other' data_files: - split: 'test' path: 'other/test/*.parquet' - split: 'train.500' path: 'other/train.500/*.parquet' - split: 'validation' path: 'other/validation/*.parquet' - config_name: 'all' default: true data_files: - split: 'test.clean' path: 'all/test.clean/*.parquet' - split: 'test.other' path: 'all/test.other/*.parquet' - split: 'train.clean.100' path: 'all/train.clean.100/*.parquet' - split: 'train.clean.360' path: 'all/train.clean.360/*.parquet' - split: 'train.other.500' path: 'all/train.other.500/*.parquet' - split: 'validation.clean' path: 'all/validation.clean/*.parquet' - split: 'validation.other' path: 'all/validation.other/*.parquet' --- # Dataset Card for LibriSpeech-PC ## Dataset Description - **Homepage:** https://www.openslr.org/145/ - **Source Audio:** https://www.openslr.org/12 - **Repository:** https://huggingface.co/datasets/openslr/librispeech_asr - **Language:** English - **License:** CC BY 4.0 ### Dataset Summary LibriSpeech-PC is a parquet-backed merge of `openslr/librispeech_asr` audio metadata with SLR145 punctuation/capitalization manifests. It preserves the original LibriSpeech config/split layout (`clean`, `other`, `all`) and adds punctuation/casing targets. ## Dataset Structure ### Data Fields - `file`: path to the original LibriSpeech audio file. - `audio`: `Audio` feature with 16kHz sampling rate. - `text`: punctuated + cased transcript from LibriSpeech-PC manifests. - `text_raw`: raw transcript from LibriSpeech-PC manifests. - `text_normalized`: original normalized LibriSpeech ASR transcript. - `speaker_id`: speaker identifier. - `chapter_id`: chapter identifier. - `id`: utterance identifier. ### Data Splits Split names and configs mirror `openslr/librispeech_asr`. Some rows may be absent because the SLR145 manifests drop samples during punctuation/capitalization restoration. ## Additional Information ### Citation Information ```bibtex @article{meister2023librispeechpc, title={LibriSpeech-PC: Benchmark for Evaluation of Punctuation and Capitalization Capabilities of end-to-end ASR Models}, author={Meister, A. and Novikov, M. and Karpov, N. and Bakhturina, E. and Lavrukhin, V. and Ginsburg, B.}, journal={arXiv preprint arXiv:2310.02943}, year={2023} } @inproceedings{panayotov2015librispeech, title={LibriSpeech: An ASR corpus based on public domain audio books}, author={Panayotov, V. and Chen, G. and Povey, D. and Khudanpur, S.}, booktitle={ICASSP}, year={2015}, doi={10.1109/ICASSP.2015.7178964} } ``` ### Source Links - LibriSpeech-PC (SLR145): https://www.openslr.org/145/ - LibriSpeech (SLR12): https://www.openslr.org/12 - Hugging Face LibriSpeech ASR parquet source: https://huggingface.co/datasets/openslr/librispeech_asr

提供机构：

mythicinfinity

5,000+

优质数据集

54 个

任务类型

进入经典数据集