five

NRK-KIHUB/nrk-norwegian-speech-sample-v1

收藏
Hugging Face2026-03-25 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/NRK-KIHUB/nrk-norwegian-speech-sample-v1
下载链接
链接失效反馈
官方服务:
资源简介:
# NRK Norwegian Speech Dataset (Sample) ## Dataset Description > **Note**: This is a sample dataset containing a subset of chunks for demonstration and preview purposes. > The full dataset is available privately. This dataset contains Norwegian speech data from NRK TV sports broadcasts, processed for automatic speech recognition (ASR) evaluation and research. ### Dataset Statistics - **Total chunks**: 123 - **Episodes**: 41 - **Total duration**: 0.25 hours - **Chunk types**: subtitle_aligned - **Transcription sources**: Speechmatics, NRK Subtitles, Gemini - **Sample rate**: 16,000 Hz - **Language**: Norwegian (Bokmål) ### Splits - **test**: 123 chunks ### Episodes - DKMR98031126 - DKOV98031226 - DKRO98031226 - DKTL98031126 - DKTR98031226 - ISPO10501025 - ISPO40102125 - ISPO40201425 - MSPO30303825 - MSPO30651525 - MSPO30654225 - MSPO55380126 - MSPO55380226 - MSPO55380326 - MSPO55380426 - MSPO55380526 - MSPO55380626 - MSPO55380726 - MSPO55380826 - MSPO55380926 - MSPO55381026 - MSPO55381126 - MSPO55381226 - MSPO55381326 - MSPO55381426 - MSPO55381526 - MSPO55381626 - MSPO55381726 - MSPO55381826 - MSPO55381926 - MSPO55382026 - MSPO55382126 - MSPO55382226 - MSPO55382326 - MSPO55382426 - MSPO55382526 - MSPO55382626 - MUHU02001123 - NNFA19101525 - NNFA19101925 - NNFA51101425 ## Dataset Structure Each data point contains: - `id`: Unique chunk identifier - `audio`: Audio data (WAV format, 16kHz) - `text`: Original transcription text - `text_normalized`: Normalized text (lowercase, standardized) - `duration_seconds`: Audio duration - `chunk_type`: Type of chunk (subtitle_aligned or fixed_duration) - `episode_id`: Source episode identifier - `program_name`: NRK program name - `start_time`: Start time in source video (seconds) - `end_time`: End time in source video (seconds) ## Source - **Original videos**: NRK TV (https://tv.nrk.no) - **Transcriptions**: Speechmatics, NRK Subtitles, Gemini - **Processing**: Automated pipeline with yt-dlp, Speechmatics API, and custom chunking ## Usage ```python from datasets import load_dataset # Load dataset dataset = load_dataset("YOUR_ORG/nrk-norwegian-speech-sample") # Access first example example = dataset["test"][0] print(f"Text: {{example['text']}}") print(f"Duration: {{example['duration_seconds']:.2f}}s") # Play audio (in Jupyter/Colab) from IPython.display import Audio Audio(example['audio']['array'], rate=example['audio']['sampling_rate']) ``` ## License **TBD** - Please verify NRK terms of use before distribution. ## Citation If you use this dataset, please cite: ``` @misc{nrk_norwegian_speech_sample, title={NRK Norwegian Speech Dataset (Sample)}, author={Your Organization}, year={2026}, url={https://huggingface.co/datasets/YOUR_ORG/nrk-norwegian-speech-sample} } ``` ## Contact For questions or issues, please contact: [your.email@example.com]
提供机构:
NRK-KIHUB
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作