gam30/Nepali-asr-train-val
收藏Hugging Face2026-04-05 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/gam30/Nepali-asr-train-val
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: audio
dtype: audio
- name: text
dtype: string
- name: duration
dtype: float64
splits:
- name: train
num_bytes: 1808020899
num_examples: 15820
- name: val
num_bytes: 436296937
num_examples: 3955
download_size: 2238681396
dataset_size: 2244317836
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
- split: val
path: data/val-*
---
# Nepali ASR Train and Validation Set (Clean & Noisy)
This dataset contains approximately **19.41 hours** of total Nepali speech data sourced from OpenSLR, organized into an 80:20 train and validation split. To improve model robustness in real-world scenarios, 40% of the samples in both splits have been augmented with background noise (crowd, traffic, construction, and wind). The remaining 60% consists of clean audio.
## Dataset Overview
| Property | Value |
|----------|-------|
| **Language** | Nepali |
| **Source** | OpenSLR |
| **Total Samples** | 19,775 |
| **Noise Type** | Synthetic environmental noise (crowd, traffic, construction, wind) |
| **Noise Coverage** | 40% augmented, 60% clean |
| **Audio Format** | WAV |
| **Sample Rate** | 16,000 Hz (16 kHz) |
## Dataset Features
Each sample in the dataset contains:
- **`audio`** (Audio): The audio waveform data
- Sample Rate: **16,000 Hz** (16 kHz)
- Channels: 1 (Mono)
- **`text`** (String): Full Nepali transcription of the speech
- **`duration`** (Float): Audio duration in seconds
## Noise Characteristics
40% of the audio samples contain synthetic environmental noise mixed with the clean Nepali speech:
- **Noise Sources**:
- 🏢 Crowd noise (background conversations, ambient chatter)
- 🚗 Traffic noise (vehicle engines, horns, road sounds)
- 🏗️ Construction noise (machinery, tools, equipment)
- 💨 Wind noise (outdoor wind, air movements)
## Loading the Dataset
### Using HuggingFace `datasets` library
```python
from datasets import load_dataset
# Load the dataset
dataset = load_dataset("gam30/Nepali-asr-train-val")
# Access samples from the train split
sample = dataset['train'][0]
print(f"Transcription: {sample['text']}")
print(f"Duration: {sample['duration']}s")
# Iterate through dataset
for sample in dataset['train']:
text = sample['text']
duration = sample['duration']
print(f"{duration}s - {text[:50]}...")
```
### With Audio Feature
```python
from datasets import load_dataset, Audio
dataset = load_dataset("gam30/Nepali-asr-train-val")
dataset = dataset.cast_column("audio", Audio(sampling_rate=16000))
# Access audio data
sample = dataset['train'][0]
print(f"Array shape: {sample['audio']['array'].shape}")
print(f"Sampling rate: {sample['audio']['sampling_rate']} Hz") # Will be 16000
```
### Working with Audio Files
```python
from datasets import load_dataset
import librosa
dataset = load_dataset("gam30/Nepali-asr-train-val")
# Process first 5 samples
for sample in dataset['train'][:5]:
# Load audio using librosa
audio_path = sample['audio']['path']
y, sr = librosa.load(audio_path, sr=None)
print(f"File: {audio_path}")
print(f"Sampling rate: {sr} Hz")
print(f"Duration: {len(y) / sr:.2f}s")
print(f"Text: {sample['text'][:60]}...")
print()
```
## Use Cases
This dataset is suitable for:
1. **Robust ASR Model Training** - Training models on noisy speech
2. **Noise Robustness Testing** - Evaluating ASR systems on noisy conditions
3. **Domain Adaptation** - Fine-tuning pre-trained models on Nepali
4. **Speech Enhancement Research** - Testing denoising techniques
## Dataset Statistics
- **Total Samples**: 19,775
- **Total Audio Duration**: ~19.41 hours
- **Train Split**: 15,820 samples (~15.59 hours)
- **Validation Split**: 3,955 samples (~3.82 hours)
- **Sample Rate**: 16,000 Hz (16 kHz Mono) - standard for ASR tasks
## Citation
If you use this dataset in your research, please cite:
```bibtex
@dataset{nepali_asr_noisy_2024,
title={Nepali ASR Train and Validation noisy set},
author={sangam},
year={2026},
publisher={Hugging Face},
url={https://huggingface.co/datasets/gam30/Nepali-asr-train-val}
}
```
## Quality Assurance
- ✓ All transcriptions in UTF-8 Unicode format
- ✓ Duration metadata computed and validated
- ✓ Audio verified at 16,000Hz mono
## Dataset Structure
```text
gam30/Nepali-asr-train-val
├── train/
│ ├── audio (Audio)
│ ├── text (String)
│ └── duration (Float)
└── val/
├── audio (Audio)
├── text (String)
└── duration (Float)
```
## Support & Issues
For questions or issues with the dataset:
1. Check the Hugging Face community discussions
2. Open an issue on the dataset repository
---
**Dataset ID**: `gam30/Nepali-asr-train-val`
**Last Updated**: 2026
提供机构:
gam30



