humair025/munch_urdu_preview
收藏Hugging Face2025-12-06 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/humair025/munch_urdu_preview
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
task_categories:
- text-to-speech
- automatic-speech-recognition
language:
- ur
tags:
- Urdu
- TTS
- ASR
- speech-synthesis
- preview
- audio
size_categories:
- 10K<n<100K
pretty_name: Munch Preview
---
# 🎧 Munch Preview Dataset
[](https://huggingface.co/datasets/humair025/Munch)
[](https://huggingface.co/datasets/humair025/munch-1)
[](https://creativecommons.org/licenses/by/4.0/)
[]()
## 📖 Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Structure](#dataset-structure)
- [Dataset Creation](#dataset-creation)
- [Usage](#usage)
- [Considerations](#considerations)
- [Citation](#citation)
- [Contact](#contact)
---
## 📋 Dataset Description
### Overview
**Munch Preview** is a carefully curated preview dataset containing ** high-quality Urdu text-to-speech samples** from both versions of the Munch dataset family. This lightweight version allows researchers, developers, and practitioners to quickly explore and prototype with Urdu TTS data without downloading the full multi-terabyte datasets.
### Purpose
This preview dataset serves multiple purposes:
- **🎯 Quick Exploration**: Rapidly assess the quality and characteristics of Munch datasets
- **🧪 Prototyping**: Test TTS/ASR pipelines before committing to full dataset downloads
- **📚 Educational**: Learn about Urdu speech synthesis with manageable data sizes
- **🔬 Algorithm Development**: Develop and validate algorithms on representative samples
- **📊 Comparative Analysis**: Compare v1 and v2 dataset characteristics side-by-side
### Key Features
✅ **Two Dataset Versions**: Samples from both Munch v1 and Munch-1 v2
✅ **Balanced Sampling**: Stratified random sampling across all 13 voices
✅ **WAV Format**: High-quality audio playable directly in HuggingFace viewer
✅ **Fast Download**: ~4.36 GB vs 4+ TB for full datasets
✅ **Production Quality**: Same preprocessing and quality as full datasets
✅ **Metadata Rich**: Complete transcripts, timestamps, and error tracking
### Languages
- **Primary**: Urdu (ur)
- **Script**: Arabic script (Nastaliq)
---
## 📊 Dataset Structure
### Data Instances
Each sample in the dataset contains:
```python
{
'id': 123456,
'text': 'یہ ایک نمونہ متن ہے',
'transcript': 'یہ ایک نمونہ متن ہے',
'voice': 'ash',
'audio': {
'array': array([...]), # Audio waveform
'path': None,
'sampling_rate': 22050
},
'timestamp': '2025-12-01T10:30:45',
'error': None,
'original_parquet': 'train-00123.parquet',
'dataset_version': 'munch_v1'
}
```
### Data Splits
| Split | Samples | Size | Source | Description |
|-------|---------|------|--------|-------------|
| **munch_v1** | ~4.9k | ~1.68 GB | [Munch v1](https://huggingface.co/datasets/humair025/Munch) | Samples from original Munch dataset (4.17M total) |
| **munch_v2** | ~4k | ~2.68 GB | [Munch-1 v2](https://huggingface.co/datasets/humair025/munch-1) | Samples from improved Munch-1 dataset (3.86M total) |
| **Total** | ~8k | ~4.36 GB | - | Combined preview dataset |
### Data Fields
| Field | Type | Description |
|-------|------|-------------|
| `id` | `int` | Unique paragraph identifier from source dataset |
| `text` | `string` | Original Urdu text (input to TTS system) |
| `transcript` | `string` | Transcription of generated audio (may differ from input) |
| `voice` | `string` | Voice identifier (13 options per dataset) |
| `audio` | `Audio` | Audio data with waveform array and metadata |
| `timestamp` | `string` | ISO 8601 timestamp of audio generation |
| `error` | `string` | Error message if generation failed (usually None) |
| `original_parquet` | `string` | Source parquet file from full dataset |
| `dataset_version` | `string` | Version identifier: "munch_v1" or "munch_v2" |
### Audio Specifications
- **Format**: WAV (Waveform Audio File Format)
- **Sample Rate**: 22,050 Hz
- **Channels**: Mono (1 channel)
- **Bit Depth**: 16-bit signed integer PCM
- **Encoding**: Linear PCM
- **Average Duration**: X seconds per sample
- **Total Duration**: ~X hours (combined)
### Voice Distribution
Each split contains samples from **13 different voices** with approximately balanced distribution (~770 samples per voice):
**Available Voices**:
- `alloy` - Neutral, clear voice
- `echo` - Resonant, deep voice
- `fable` - Storytelling voice
- `onyx` - Strong, authoritative voice
- `nova` - Bright, energetic voice
- `shimmer` - Soft, gentle voice
- `coral` - Warm, friendly voice
- `verse` - Poetic, expressive voice
- `ballad` - Melodic, smooth voice
- `ash` - Natural, conversational voice
- `sage` - Wise, measured voice
- `amuch` - Custom voice variant
- `dan` - Custom voice variant
---
## 🔨 Dataset Creation
### Source Data
#### Initial Data Collection
This preview dataset is derived from two large-scale Urdu TTS datasets:
1. **Munch v1** (humair025/Munch)
- Total Size: 1.27 TB
- Total Samples: 4,167,500
- Collection Period: 2025
2. **Munch-1 v2** (humair025/munch-1)
- Total Size: 3.28 TB
- Total Samples: 3,856,500
- Collection Period: 2025-2025
#### Data Pipeline
The source datasets were created using:
- Urdu text corpus from various domains (literature, news, social media, technical)
- State-of-the-art neural TTS synthesis
- Multiple voice profiles for diversity
- Quality validation and error tracking
### Annotations
#### Annotation Process
- **Automatic**: All audio was generated using text-to-speech systems
- **Transcription**: Generated transcripts may differ slightly from input text due to TTS normalization
- **Error Tracking**: Samples with generation errors are flagged in the `error` field
- **No Human Annotation**: This is a synthetic dataset with automatic metadata
### Personal and Sensitive Information
- **No Personal Information**: All text and audio are synthetically generated
- **No Speaker Identification**: Voices are synthetic and do not correspond to real individuals
- **No Biometric Data**: Audio is generated, not recorded from human speakers
---
## 💻 Usage
### Loading the Dataset
#### Basic Loading
```python
from datasets import load_dataset
# Load complete dataset (both splits)
dataset = load_dataset("humair025/munch_preview")
print(f"Munch v1: {len(dataset['munch_v1']):,} samples")
print(f"Munch v2: {len(dataset['munch_v2']):,} samples")
```
#### Load Specific Split
```python
# Load only v1
v1_dataset = load_dataset("humair025/munch_preview", split="munch_v1")
# Load only v2
v2_dataset = load_dataset("humair025/munch_preview", split="munch_v2")
```
#### Streaming Mode
```python
# For even lower memory usage
dataset = load_dataset("humair025/munch_preview", streaming=True)
for sample in dataset['munch_v1']:
print(sample['text'])
break
```
### Basic Usage Examples
#### 1. Audio Playback (Jupyter/Colab)
```python
import IPython.display as ipd
# Play first sample from v1
sample = dataset['munch_v1'][0]
print(f"Text: {sample['text']}")
print(f"Voice: {sample['voice']}")
ipd.display(ipd.Audio(
sample['audio']['array'],
rate=sample['audio']['sampling_rate']
))
```
#### 2. Export to WAV Files
```python
import soundfile as sf
import os
# Export first 10 samples
os.makedirs("audio_samples", exist_ok=True)
for i, sample in enumerate(dataset['munch_v1'][:10]):
filename = f"audio_samples/v1_sample_{i:03d}_{sample['voice']}.wav"
sf.write(
filename,
sample['audio']['array'],
sample['audio']['sampling_rate']
)
print(f"Saved: {filename}")
```
#### 3. Filter by Voice
```python
# Get all samples from specific voice
ash_samples = [
sample for sample in dataset['munch_v1']
if sample['voice'] == 'ash'
]
print(f"Found {len(ash_samples)} samples with 'ash' voice")
```
#### 4. Analyze Text Statistics
```python
import pandas as pd
# Convert to DataFrame for analysis
df_v1 = pd.DataFrame(dataset['munch_v1'])
df_v2 = pd.DataFrame(dataset['munch_v2'])
print("Text Length Statistics (v1):")
print(df_v1['text'].str.len().describe())
print("\nText Length Statistics (v2):")
print(df_v2['text'].str.len().describe())
```
#### 5. Audio Duration Analysis
```python
import numpy as np
# Calculate durations
v1_durations = [
len(sample['audio']['array']) / sample['audio']['sampling_rate']
for sample in dataset['munch_v1']
]
v2_durations = [
len(sample['audio']['array']) / sample['audio']['sampling_rate']
for sample in dataset['munch_v2']
]
print(f"V1 average duration: {np.mean(v1_durations):.2f}s")
print(f"V2 average duration: {np.mean(v2_durations):.2f}s")
print(f"V1 total duration: {sum(v1_durations)/3600:.2f} hours")
print(f"V2 total duration: {sum(v2_durations)/3600:.2f} hours")
```
### Advanced Usage
#### 1. Compare Dataset Versions
```python
from collections import Counter
# Voice distribution comparison
v1_voices = Counter([s['voice'] for s in dataset['munch_v1']])
v2_voices = Counter([s['voice'] for s in dataset['munch_v2']])
print("Voice Distribution Comparison:")
print(f"{'Voice':<10} {'V1 Count':<10} {'V2 Count':<10}")
print("-" * 30)
for voice in sorted(v1_voices.keys()):
print(f"{voice:<10} {v1_voices[voice]:<10} {v2_voices[voice]:<10}")
```
#### 2. Train/Validation Split
```python
from datasets import DatasetDict
# Split v1 into train/validation (80/20)
v1_split = dataset['munch_v1'].train_test_split(test_size=0.2, seed=42)
train_val_dataset = DatasetDict({
'train': v1_split['train'],
'validation': v1_split['test']
})
print(f"Train: {len(train_val_dataset['train'])} samples")
print(f"Validation: {len(train_val_dataset['validation'])} samples")
```
#### 3. Combine Both Versions
```python
from datasets import concatenate_datasets
# Combine v1 and v2 for training
combined = concatenate_datasets([
dataset['munch_v1'],
dataset['munch_v2']
])
print(f"Combined dataset: {len(combined):,} samples")
# Shuffle for training
combined_shuffled = combined.shuffle(seed=42)
```
#### 4. Feature Extraction
```python
import librosa
# Extract MFCC features from audio
def extract_features(sample):
audio_array = sample['audio']['array']
sr = sample['audio']['sampling_rate']
# Extract MFCCs
mfccs = librosa.feature.mfcc(
y=audio_array.astype(float),
sr=sr,
n_mfcc=13
)
return {
'mfcc_mean': mfccs.mean(axis=1),
'mfcc_std': mfccs.std(axis=1)
}
# Apply to first 100 samples
features = [extract_features(s) for s in dataset['munch_v1'][:100]]
```
#### 5. Create Evaluation Set
```python
# Create balanced evaluation set (50 samples per voice)
eval_samples = []
for voice in ['alloy', 'echo', 'fable', 'onyx', 'nova']:
voice_samples = [
s for s in dataset['munch_v1']
if s['voice'] == voice
][:50]
eval_samples.extend(voice_samples)
print(f"Evaluation set: {len(eval_samples)} samples")
```
---
## 🔍 Considerations for Using the Data
### Social Impact
#### Positive Impacts
- **Language Preservation**: Supports Urdu language technology development
- **Accessibility**: Enables text-to-speech applications for Urdu speakers
- **Research Enablement**: Provides researchers with quality Urdu audio data
- **Educational**: Facilitates Urdu language learning applications
- **Low Barrier**: Small size enables experimentation without major compute resources
#### Potential Concerns
- **Synthetic Bias**: Audio is synthetic and may not represent natural Urdu speech patterns
- **Voice Diversity**: Limited to 13 voice profiles, may not represent full spectrum of Urdu speakers
- **Domain Coverage**: Text sources may not cover all Urdu dialects or specialized domains
- **Quality Variance**: As a preview, sampling may not perfectly represent full dataset quality distribution
### Discussion of Biases
#### Known Biases
1. **Synthetic Speech Bias**: All audio is TTS-generated, not natural speech
- May contain artifacts specific to TTS systems
- Prosody and intonation may differ from human speech
2. **Voice Selection**: 13 voices may not represent:
- Full range of Urdu accents (Pakistani, Indian variations)
- Age diversity (child, elderly speakers)
- Regional dialects
3. **Text Domain**: Source text may be biased toward:
- Formal/written Urdu vs. colloquial speech
- Certain topics or domains
- Modern vocabulary vs. classical Urdu
4. **Sampling Bias**: Preview sampling may:
- Over/under-represent certain characteristics
- Not capture edge cases present in full dataset
### Limitations
#### Technical Limitations
- **Preview Size**: Only ~0.25% of full datasets
- **Voice Coverage**: 13 voices may be insufficient for some applications
- **Quality Variance**: Random sampling may include quality outliers
#### Use Case Limitations
**Suitable For:**
- ✅ Quick prototyping and testing
- ✅ Algorithm development
- ✅ Educational purposes
- ✅ Pipeline validation
- ✅ Quality assessment
**Not Suitable For:**
- ❌ Production model training (use full datasets)
- ❌ Comprehensive benchmarking
- ❌ Statistical significance testing
- ❌ Fine-grained quality analysis
#### Recommendations
1. **For Research**: Use full datasets (Munch v1 or v2) for final experiments
2. **For Production**: Validate on full dataset before deployment
3. **For Training**: Consider this as dev/test set, use full dataset for training
4. **For Evaluation**: Supplement with natural Urdu speech data
### Privacy and Ethics
- **No Privacy Concerns**: Fully synthetic data with no personal information
- **No Consent Required**: No human speakers involved
- **Ethical Considerations**:
- Synthetic voices should be clearly labeled as such in applications
- Consider potential misuse for deepfakes or impersonation
- Respect Urdu language and culture in applications
---
## 📈 Dataset Statistics
### Overall Statistics
| Metric | Value |
|--------|-------|
| Total Samples | ~20,000 |
| Total Size | ~1.2 GB |
| Audio Duration | ~16-28 hours |
| Languages | 1 (Urdu) |
| Voices | 13 per split |
| Sample Rate | 22,050 Hz |
| Bit Depth | 16-bit |
| Average Sample Duration | 3-5 seconds |
### Per-Split Statistics
| Metric | Munch v1 | Munch v2 |
|--------|----------|----------|
| Samples | ~10,000 | ~10,000 |
| Size | ~600 MB | ~600 MB |
| Source Dataset Size | 1.27 TB | 3.28 TB |
| Source Total Samples | 4,167,500 | 3,856,500 |
| Sampling Rate | ~0.24% | ~0.26% |
### Text Statistics (Estimated)
| Metric | Range |
|--------|-------|
| Characters per Sample | 20-200 |
| Words per Sample | 5-40 |
| Average Text Length | ~50-80 characters |
---
## 🔗 Related Datasets
### Full Datasets
| Dataset | Size | Samples | Link |
|---------|------|---------|------|
| **Munch v1** | 1.27 TB | 4.17M | [humair025/Munch](https://huggingface.co/datasets/humair025/Munch) |
| **Munch-1 v2** | 3.28 TB | 3.86M | [humair025/munch-1](https://huggingface.co/datasets/humair025/munch-1) |
### Index Datasets (Metadata Only)
| Dataset | Size | Purpose | Link |
|---------|------|---------|------|
| **Munch v1 Index** | ~1 GB | Fast exploration without audio | [humair025/hashed_data](https://huggingface.co/datasets/humair025/hashed_data) |
| **Munch v2 Index** | ~1 GB | Fast exploration without audio | [humair025/hashed_data_munch_1](https://huggingface.co/datasets/humair025/hashed_data_munch_1) |
### Upgrade Path
```
Preview Dataset (4.36 GB)
↓
Test on preview, validate approach
↓
Index Datasets (~1 GB each)
↓
Explore metadata, plan subsets
↓
Full Datasets (1.27 TB / 3.28 TB)
↓
Production training
```
---
## 📜 Licensing Information
### License
This dataset is released under **Creative Commons Attribution 4.0 International (CC-BY-4.0)**.
#### You are free to:
- **Share**: Copy and redistribute the material in any medium or format
- **Adapt**: Remix, transform, and build upon the material for any purpose, even commercially
#### Under the following terms:
- **Attribution**: You must give appropriate credit, provide a link to the license, and indicate if changes were made
[Full License Text](https://creativecommons.org/licenses/by/4.0/legalcode)
### Citation Requirements
If you use this dataset in your research or applications, please cite:
```bibtex
@dataset{munch_preview_2025,
title={Munch Preview: Quick Start Urdu Text-to-Speech Dataset},
author={ Humair Munir},
year={2025},
publisher={Hugging Face},
howpublished={\url{https://huggingface.co/datasets/humair025/munch_preview}},
note={Preview subsets of Munch v1 and Munch-1 v2 datasets}
}
```
For the full datasets, also cite:
```bibtex
@dataset{munch_v1_2025,
title={Munch: Large-Scale Urdu Text-to-Speech Dataset},
author={ Humair Munir},
author={ humair025},
year={2025},
publisher={Hugging Face},
howpublished={\url{https://huggingface.co/datasets/humair025/Munch}}
}
@dataset{munch_v2_2025,
title={Munch-1: Large-Scale Urdu Text-to-Speech Dataset},
author={ Humair Munir},
author={ humair025}
year={2025},
publisher={Hugging Face},
howpublished={\url{https://huggingface.co/datasets/humair025/munch-1}}
}
```
---
## 👥 Dataset Curators
**Created by**: Humair Munir
**Organization**: Independent
**Contact**: Available through HuggingFace dataset page
---
## 🙏 Acknowledgements
This preview dataset is made possible by:
- The creation of the full Munch and Munch-1 datasets
- HuggingFace for dataset hosting infrastructure
- Open-source TTS technology developers
---
## 📞 Contact & Support
### Questions or Issues?
- **Dataset Issues**: Use the [Discussions tab](https://huggingface.co/datasets/humair025/munch_preview/discussions)
- **Feature Requests**: Open an issue in Discussions
- **Bug Reports**: Report in Discussions with detailed information
### Additional Resources
- **Documentation**: This README
- **Full Datasets**: See Related Datasets section above
- **Community**: Join discussions on HuggingFace
---
## 📅 Changelog
### Version 1.0 (December 2025)
- Initial release
- ~4k samples from Munch v1
- ~4k samples from Munch-1 v2
- WAV format audio at 22,050 Hz
- Complete metadata and documentation
---
## ⚡ Quick Reference
### At a Glance
```python
# Installation
pip install datasets soundfile
# Load
from datasets import load_dataset
ds = load_dataset("humair025/munch_preview")
# Explore
print(ds)
print(ds['munch_v1'][0])
# Play audio (Jupyter)
import IPython.display as ipd
sample = ds['munch_v1'][0]
ipd.display(ipd.Audio(sample['audio']['array'], rate=22050))
```
### Key URLs
- **This Dataset**: https://huggingface.co/datasets/humair025/munch_preview
- **Munch v1 Full**: https://huggingface.co/datasets/humair025/Munch
- **Munch v2 Full**: https://huggingface.co/datasets/humair025/munch-1
- **License**: https://creativecommons.org/licenses/by/4.0/
---
*Last Updated: December 2025*
提供机构:
humair025



