NRK-KIHUB/nrk-norwegian-speech-sample-v1

Name: NRK-KIHUB/nrk-norwegian-speech-sample-v1
Creator: NRK-KIHUB
Published: 2026-03-25 09:37:57
License: 暂无描述

Hugging Face2026-03-25 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/NRK-KIHUB/nrk-norwegian-speech-sample-v1

下载链接

链接失效反馈

官方服务：

资源简介：

# NRK Norwegian Speech Dataset (Sample) ## Dataset Description > **Note**: This is a sample dataset containing a subset of chunks for demonstration and preview purposes. > The full dataset is available privately. This dataset contains Norwegian speech data from NRK TV sports broadcasts, processed for automatic speech recognition (ASR) evaluation and research. ### Dataset Statistics - **Total chunks**: 123 - **Episodes**: 41 - **Total duration**: 0.25 hours - **Chunk types**: subtitle_aligned - **Transcription sources**: Speechmatics, NRK Subtitles, Gemini - **Sample rate**: 16,000 Hz - **Language**: Norwegian (Bokmål) ### Splits - **test**: 123 chunks ### Episodes - DKMR98031126 - DKOV98031226 - DKRO98031226 - DKTL98031126 - DKTR98031226 - ISPO10501025 - ISPO40102125 - ISPO40201425 - MSPO30303825 - MSPO30651525 - MSPO30654225 - MSPO55380126 - MSPO55380226 - MSPO55380326 - MSPO55380426 - MSPO55380526 - MSPO55380626 - MSPO55380726 - MSPO55380826 - MSPO55380926 - MSPO55381026 - MSPO55381126 - MSPO55381226 - MSPO55381326 - MSPO55381426 - MSPO55381526 - MSPO55381626 - MSPO55381726 - MSPO55381826 - MSPO55381926 - MSPO55382026 - MSPO55382126 - MSPO55382226 - MSPO55382326 - MSPO55382426 - MSPO55382526 - MSPO55382626 - MUHU02001123 - NNFA19101525 - NNFA19101925 - NNFA51101425 ## Dataset Structure Each data point contains: - `id`: Unique chunk identifier - `audio`: Audio data (WAV format, 16kHz) - `text`: Original transcription text - `text_normalized`: Normalized text (lowercase, standardized) - `duration_seconds`: Audio duration - `chunk_type`: Type of chunk (subtitle_aligned or fixed_duration) - `episode_id`: Source episode identifier - `program_name`: NRK program name - `start_time`: Start time in source video (seconds) - `end_time`: End time in source video (seconds) ## Source - **Original videos**: NRK TV (https://tv.nrk.no) - **Transcriptions**: Speechmatics, NRK Subtitles, Gemini - **Processing**: Automated pipeline with yt-dlp, Speechmatics API, and custom chunking ## Usage ```python from datasets import load_dataset # Load dataset dataset = load_dataset("YOUR_ORG/nrk-norwegian-speech-sample") # Access first example example = dataset["test"][0] print(f"Text: {{example['text']}}") print(f"Duration: {{example['duration_seconds']:.2f}}s") # Play audio (in Jupyter/Colab) from IPython.display import Audio Audio(example['audio']['array'], rate=example['audio']['sampling_rate']) ``` ## License **TBD** - Please verify NRK terms of use before distribution. ## Citation If you use this dataset, please cite: ``` @misc{nrk_norwegian_speech_sample, title={NRK Norwegian Speech Dataset (Sample)}, author={Your Organization}, year={2026}, url={https://huggingface.co/datasets/YOUR_ORG/nrk-norwegian-speech-sample} } ``` ## Contact For questions or issues, please contact: [your.email@example.com]

提供机构：

NRK-KIHUB

5,000+

优质数据集

54 个

任务类型

进入经典数据集