NRK-KIHUB/nrk-norwegian-speech-sample-v1
收藏Hugging Face2026-03-25 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/NRK-KIHUB/nrk-norwegian-speech-sample-v1
下载链接
链接失效反馈官方服务:
资源简介:
# NRK Norwegian Speech Dataset (Sample)
## Dataset Description
> **Note**: This is a sample dataset containing a subset of chunks for demonstration and preview purposes.
> The full dataset is available privately.
This dataset contains Norwegian speech data from NRK TV sports broadcasts, processed for automatic speech recognition (ASR) evaluation and research.
### Dataset Statistics
- **Total chunks**: 123
- **Episodes**: 41
- **Total duration**: 0.25 hours
- **Chunk types**: subtitle_aligned
- **Transcription sources**: Speechmatics, NRK Subtitles, Gemini
- **Sample rate**: 16,000 Hz
- **Language**: Norwegian (Bokmål)
### Splits
- **test**: 123 chunks
### Episodes
- DKMR98031126
- DKOV98031226
- DKRO98031226
- DKTL98031126
- DKTR98031226
- ISPO10501025
- ISPO40102125
- ISPO40201425
- MSPO30303825
- MSPO30651525
- MSPO30654225
- MSPO55380126
- MSPO55380226
- MSPO55380326
- MSPO55380426
- MSPO55380526
- MSPO55380626
- MSPO55380726
- MSPO55380826
- MSPO55380926
- MSPO55381026
- MSPO55381126
- MSPO55381226
- MSPO55381326
- MSPO55381426
- MSPO55381526
- MSPO55381626
- MSPO55381726
- MSPO55381826
- MSPO55381926
- MSPO55382026
- MSPO55382126
- MSPO55382226
- MSPO55382326
- MSPO55382426
- MSPO55382526
- MSPO55382626
- MUHU02001123
- NNFA19101525
- NNFA19101925
- NNFA51101425
## Dataset Structure
Each data point contains:
- `id`: Unique chunk identifier
- `audio`: Audio data (WAV format, 16kHz)
- `text`: Original transcription text
- `text_normalized`: Normalized text (lowercase, standardized)
- `duration_seconds`: Audio duration
- `chunk_type`: Type of chunk (subtitle_aligned or fixed_duration)
- `episode_id`: Source episode identifier
- `program_name`: NRK program name
- `start_time`: Start time in source video (seconds)
- `end_time`: End time in source video (seconds)
## Source
- **Original videos**: NRK TV (https://tv.nrk.no)
- **Transcriptions**: Speechmatics, NRK Subtitles, Gemini
- **Processing**: Automated pipeline with yt-dlp, Speechmatics API, and custom chunking
## Usage
```python
from datasets import load_dataset
# Load dataset
dataset = load_dataset("YOUR_ORG/nrk-norwegian-speech-sample")
# Access first example
example = dataset["test"][0]
print(f"Text: {{example['text']}}")
print(f"Duration: {{example['duration_seconds']:.2f}}s")
# Play audio (in Jupyter/Colab)
from IPython.display import Audio
Audio(example['audio']['array'], rate=example['audio']['sampling_rate'])
```
## License
**TBD** - Please verify NRK terms of use before distribution.
## Citation
If you use this dataset, please cite:
```
@misc{nrk_norwegian_speech_sample,
title={NRK Norwegian Speech Dataset (Sample)},
author={Your Organization},
year={2026},
url={https://huggingface.co/datasets/YOUR_ORG/nrk-norwegian-speech-sample}
}
```
## Contact
For questions or issues, please contact: [your.email@example.com]
提供机构:
NRK-KIHUB



