ggfox00000/stt-voxpopuli-test-en

Name: ggfox00000/stt-voxpopuli-test-en
Creator: ggfox00000
Published: 2026-04-28 13:54:15
License: 暂无描述

Hugging Face2026-04-28 更新2026-05-03 收录

下载链接：

https://hf-mirror.com/datasets/ggfox00000/stt-voxpopuli-test-en

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-4.0 language: - en task_categories: - automatic-speech-recognition size_categories: - 1K<n<10K configs: - config_name: default data_files: - split: test path: data/test-* dataset_info: features: - name: audio_id dtype: string - name: language dtype: int64 - name: audio dtype: audio: sampling_rate: 16000 - name: raw_text dtype: string - name: normalized_text dtype: string - name: gender dtype: string - name: speaker_id dtype: string - name: is_gold_transcript dtype: bool - name: accent dtype: string splits: - name: test num_examples: 1842 --- # VoxPopuli EN — test split Mirror byte-exact du split `test` de `facebook/voxpopuli` config `en`. - 1842 utt parlementaires (Parlement européen, 2009–2020) - 16 kHz mono WAV embedded - Référence WER : `normalized_text` (préférée à `raw_text`) - Champ `is_gold_transcript` : True ⇒ transcription validée humaine - Source : Wang, Riviere et al. *VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation*, ACL 2021. ## Usage ```python from datasets import load_dataset ds = load_dataset("ggfox00000/stt-voxpopuli-test-en", split="test") print(ds[0]["normalized_text"], ds[0]["audio"]["sampling_rate"]) ```

提供机构：

ggfox00000

5,000+

优质数据集

54 个

任务类型

进入经典数据集