humair025/munch_urdu_preview

Name: humair025/munch_urdu_preview
Creator: humair025
Published: 2025-12-06 14:16:03
License: 暂无描述

Hugging Face2025-12-06 更新2025-12-20 收录

下载链接：

https://hf-mirror.com/datasets/humair025/munch_urdu_preview

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-4.0 task_categories: - text-to-speech - automatic-speech-recognition language: - ur tags: - Urdu - TTS - ASR - speech-synthesis - preview - audio size_categories: - 10K<n<100K pretty_name: Munch Preview --- # 🎧 Munch Preview Dataset [![Munch v1](https://img.shields.io/badge/🤗%20Full%20Dataset%20v1-1.27TB-blue)](https://huggingface.co/datasets/humair025/Munch) [![Munch v2](https://img.shields.io/badge/🤗%20Full%20Dataset%20v2-3.28TB-blue)](https://huggingface.co/datasets/humair025/munch-1) [![License](https://img.shields.io/badge/License-CC--BY--4.0-green)](https://creativecommons.org/licenses/by/4.0/) [![Size](https://img.shields.io/badge/Size-~4.36GB-brightgreen)]() ## 📖 Table of Contents - [Dataset Description](#dataset-description) - [Dataset Structure](#dataset-structure) - [Dataset Creation](#dataset-creation) - [Usage](#usage) - [Considerations](#considerations) - [Citation](#citation) - [Contact](#contact) --- ## 📋 Dataset Description ### Overview **Munch Preview** is a carefully curated preview dataset containing ** high-quality Urdu text-to-speech samples** from both versions of the Munch dataset family. This lightweight version allows researchers, developers, and practitioners to quickly explore and prototype with Urdu TTS data without downloading the full multi-terabyte datasets. ### Purpose This preview dataset serves multiple purposes: - **🎯 Quick Exploration**: Rapidly assess the quality and characteristics of Munch datasets - **🧪 Prototyping**: Test TTS/ASR pipelines before committing to full dataset downloads - **📚 Educational**: Learn about Urdu speech synthesis with manageable data sizes - **🔬 Algorithm Development**: Develop and validate algorithms on representative samples - **📊 Comparative Analysis**: Compare v1 and v2 dataset characteristics side-by-side ### Key Features ✅ **Two Dataset Versions**: Samples from both Munch v1 and Munch-1 v2 ✅ **Balanced Sampling**: Stratified random sampling across all 13 voices ✅ **WAV Format**: High-quality audio playable directly in HuggingFace viewer ✅ **Fast Download**: ~4.36 GB vs 4+ TB for full datasets ✅ **Production Quality**: Same preprocessing and quality as full datasets ✅ **Metadata Rich**: Complete transcripts, timestamps, and error tracking ### Languages - **Primary**: Urdu (ur) - **Script**: Arabic script (Nastaliq) --- ## 📊 Dataset Structure ### Data Instances Each sample in the dataset contains: ```python { 'id': 123456, 'text': 'یہ ایک نمونہ متن ہے', 'transcript': 'یہ ایک نمونہ متن ہے', 'voice': 'ash', 'audio': { 'array': array([...]), # Audio waveform 'path': None, 'sampling_rate': 22050 }, 'timestamp': '2025-12-01T10:30:45', 'error': None, 'original_parquet': 'train-00123.parquet', 'dataset_version': 'munch_v1' } ``` ### Data Splits | Split | Samples | Size | Source | Description | |-------|---------|------|--------|-------------| | **munch_v1** | ~4.9k | ~1.68 GB | [Munch v1](https://huggingface.co/datasets/humair025/Munch) | Samples from original Munch dataset (4.17M total) | | **munch_v2** | ~4k | ~2.68 GB | [Munch-1 v2](https://huggingface.co/datasets/humair025/munch-1) | Samples from improved Munch-1 dataset (3.86M total) | | **Total** | ~8k | ~4.36 GB | - | Combined preview dataset | ### Data Fields | Field | Type | Description | |-------|------|-------------| | `id` | `int` | Unique paragraph identifier from source dataset | | `text` | `string` | Original Urdu text (input to TTS system) | | `transcript` | `string` | Transcription of generated audio (may differ from input) | | `voice` | `string` | Voice identifier (13 options per dataset) | | `audio` | `Audio` | Audio data with waveform array and metadata | | `timestamp` | `string` | ISO 8601 timestamp of audio generation | | `error` | `string` | Error message if generation failed (usually None) | | `original_parquet` | `string` | Source parquet file from full dataset | | `dataset_version` | `string` | Version identifier: "munch_v1" or "munch_v2" | ### Audio Specifications - **Format**: WAV (Waveform Audio File Format) - **Sample Rate**: 22,050 Hz - **Channels**: Mono (1 channel) - **Bit Depth**: 16-bit signed integer PCM - **Encoding**: Linear PCM - **Average Duration**: X seconds per sample - **Total Duration**: ~X hours (combined) ### Voice Distribution Each split contains samples from **13 different voices** with approximately balanced distribution (~770 samples per voice): **Available Voices**: - `alloy` - Neutral, clear voice - `echo` - Resonant, deep voice - `fable` - Storytelling voice - `onyx` - Strong, authoritative voice - `nova` - Bright, energetic voice - `shimmer` - Soft, gentle voice - `coral` - Warm, friendly voice - `verse` - Poetic, expressive voice - `ballad` - Melodic, smooth voice - `ash` - Natural, conversational voice - `sage` - Wise, measured voice - `amuch` - Custom voice variant - `dan` - Custom voice variant --- ## 🔨 Dataset Creation ### Source Data #### Initial Data Collection This preview dataset is derived from two large-scale Urdu TTS datasets: 1. **Munch v1** (humair025/Munch) - Total Size: 1.27 TB - Total Samples: 4,167,500 - Collection Period: 2025 2. **Munch-1 v2** (humair025/munch-1) - Total Size: 3.28 TB - Total Samples: 3,856,500 - Collection Period: 2025-2025 #### Data Pipeline The source datasets were created using: - Urdu text corpus from various domains (literature, news, social media, technical) - State-of-the-art neural TTS synthesis - Multiple voice profiles for diversity - Quality validation and error tracking ### Annotations #### Annotation Process - **Automatic**: All audio was generated using text-to-speech systems - **Transcription**: Generated transcripts may differ slightly from input text due to TTS normalization - **Error Tracking**: Samples with generation errors are flagged in the `error` field - **No Human Annotation**: This is a synthetic dataset with automatic metadata ### Personal and Sensitive Information - **No Personal Information**: All text and audio are synthetically generated - **No Speaker Identification**: Voices are synthetic and do not correspond to real individuals - **No Biometric Data**: Audio is generated, not recorded from human speakers --- ## 💻 Usage ### Loading the Dataset #### Basic Loading ```python from datasets import load_dataset # Load complete dataset (both splits) dataset = load_dataset("humair025/munch_preview") print(f"Munch v1: {len(dataset['munch_v1']):,} samples") print(f"Munch v2: {len(dataset['munch_v2']):,} samples") ``` #### Load Specific Split ```python # Load only v1 v1_dataset = load_dataset("humair025/munch_preview", split="munch_v1") # Load only v2 v2_dataset = load_dataset("humair025/munch_preview", split="munch_v2") ``` #### Streaming Mode ```python # For even lower memory usage dataset = load_dataset("humair025/munch_preview", streaming=True) for sample in dataset['munch_v1']: print(sample['text']) break ``` ### Basic Usage Examples #### 1. Audio Playback (Jupyter/Colab) ```python import IPython.display as ipd # Play first sample from v1 sample = dataset['munch_v1'][0] print(f"Text: {sample['text']}") print(f"Voice: {sample['voice']}") ipd.display(ipd.Audio( sample['audio']['array'], rate=sample['audio']['sampling_rate'] )) ``` #### 2. Export to WAV Files ```python import soundfile as sf import os # Export first 10 samples os.makedirs("audio_samples", exist_ok=True) for i, sample in enumerate(dataset['munch_v1'][:10]): filename = f"audio_samples/v1_sample_{i:03d}_{sample['voice']}.wav" sf.write( filename, sample['audio']['array'], sample['audio']['sampling_rate'] ) print(f"Saved: {filename}") ``` #### 3. Filter by Voice ```python # Get all samples from specific voice ash_samples = [ sample for sample in dataset['munch_v1'] if sample['voice'] == 'ash' ] print(f"Found {len(ash_samples)} samples with 'ash' voice") ``` #### 4. Analyze Text Statistics ```python import pandas as pd # Convert to DataFrame for analysis df_v1 = pd.DataFrame(dataset['munch_v1']) df_v2 = pd.DataFrame(dataset['munch_v2']) print("Text Length Statistics (v1):") print(df_v1['text'].str.len().describe()) print("\nText Length Statistics (v2):") print(df_v2['text'].str.len().describe()) ``` #### 5. Audio Duration Analysis ```python import numpy as np # Calculate durations v1_durations = [ len(sample['audio']['array']) / sample['audio']['sampling_rate'] for sample in dataset['munch_v1'] ] v2_durations = [ len(sample['audio']['array']) / sample['audio']['sampling_rate'] for sample in dataset['munch_v2'] ] print(f"V1 average duration: {np.mean(v1_durations):.2f}s") print(f"V2 average duration: {np.mean(v2_durations):.2f}s") print(f"V1 total duration: {sum(v1_durations)/3600:.2f} hours") print(f"V2 total duration: {sum(v2_durations)/3600:.2f} hours") ``` ### Advanced Usage #### 1. Compare Dataset Versions ```python from collections import Counter # Voice distribution comparison v1_voices = Counter([s['voice'] for s in dataset['munch_v1']]) v2_voices = Counter([s['voice'] for s in dataset['munch_v2']]) print("Voice Distribution Comparison:") print(f"{'Voice':<10} {'V1 Count':<10} {'V2 Count':<10}") print("-" * 30) for voice in sorted(v1_voices.keys()): print(f"{voice:<10} {v1_voices[voice]:<10} {v2_voices[voice]:<10}") ``` #### 2. Train/Validation Split ```python from datasets import DatasetDict # Split v1 into train/validation (80/20) v1_split = dataset['munch_v1'].train_test_split(test_size=0.2, seed=42) train_val_dataset = DatasetDict({ 'train': v1_split['train'], 'validation': v1_split['test'] }) print(f"Train: {len(train_val_dataset['train'])} samples") print(f"Validation: {len(train_val_dataset['validation'])} samples") ``` #### 3. Combine Both Versions ```python from datasets import concatenate_datasets # Combine v1 and v2 for training combined = concatenate_datasets([ dataset['munch_v1'], dataset['munch_v2'] ]) print(f"Combined dataset: {len(combined):,} samples") # Shuffle for training combined_shuffled = combined.shuffle(seed=42) ``` #### 4. Feature Extraction ```python import librosa # Extract MFCC features from audio def extract_features(sample): audio_array = sample['audio']['array'] sr = sample['audio']['sampling_rate'] # Extract MFCCs mfccs = librosa.feature.mfcc( y=audio_array.astype(float), sr=sr, n_mfcc=13 ) return { 'mfcc_mean': mfccs.mean(axis=1), 'mfcc_std': mfccs.std(axis=1) } # Apply to first 100 samples features = [extract_features(s) for s in dataset['munch_v1'][:100]] ``` #### 5. Create Evaluation Set ```python # Create balanced evaluation set (50 samples per voice) eval_samples = [] for voice in ['alloy', 'echo', 'fable', 'onyx', 'nova']: voice_samples = [ s for s in dataset['munch_v1'] if s['voice'] == voice ][:50] eval_samples.extend(voice_samples) print(f"Evaluation set: {len(eval_samples)} samples") ``` --- ## 🔍 Considerations for Using the Data ### Social Impact #### Positive Impacts - **Language Preservation**: Supports Urdu language technology development - **Accessibility**: Enables text-to-speech applications for Urdu speakers - **Research Enablement**: Provides researchers with quality Urdu audio data - **Educational**: Facilitates Urdu language learning applications - **Low Barrier**: Small size enables experimentation without major compute resources #### Potential Concerns - **Synthetic Bias**: Audio is synthetic and may not represent natural Urdu speech patterns - **Voice Diversity**: Limited to 13 voice profiles, may not represent full spectrum of Urdu speakers - **Domain Coverage**: Text sources may not cover all Urdu dialects or specialized domains - **Quality Variance**: As a preview, sampling may not perfectly represent full dataset quality distribution ### Discussion of Biases #### Known Biases 1. **Synthetic Speech Bias**: All audio is TTS-generated, not natural speech - May contain artifacts specific to TTS systems - Prosody and intonation may differ from human speech 2. **Voice Selection**: 13 voices may not represent: - Full range of Urdu accents (Pakistani, Indian variations) - Age diversity (child, elderly speakers) - Regional dialects 3. **Text Domain**: Source text may be biased toward: - Formal/written Urdu vs. colloquial speech - Certain topics or domains - Modern vocabulary vs. classical Urdu 4. **Sampling Bias**: Preview sampling may: - Over/under-represent certain characteristics - Not capture edge cases present in full dataset ### Limitations #### Technical Limitations - **Preview Size**: Only ~0.25% of full datasets - **Voice Coverage**: 13 voices may be insufficient for some applications - **Quality Variance**: Random sampling may include quality outliers #### Use Case Limitations **Suitable For:** - ✅ Quick prototyping and testing - ✅ Algorithm development - ✅ Educational purposes - ✅ Pipeline validation - ✅ Quality assessment **Not Suitable For:** - ❌ Production model training (use full datasets) - ❌ Comprehensive benchmarking - ❌ Statistical significance testing - ❌ Fine-grained quality analysis #### Recommendations 1. **For Research**: Use full datasets (Munch v1 or v2) for final experiments 2. **For Production**: Validate on full dataset before deployment 3. **For Training**: Consider this as dev/test set, use full dataset for training 4. **For Evaluation**: Supplement with natural Urdu speech data ### Privacy and Ethics - **No Privacy Concerns**: Fully synthetic data with no personal information - **No Consent Required**: No human speakers involved - **Ethical Considerations**: - Synthetic voices should be clearly labeled as such in applications - Consider potential misuse for deepfakes or impersonation - Respect Urdu language and culture in applications --- ## 📈 Dataset Statistics ### Overall Statistics | Metric | Value | |--------|-------| | Total Samples | ~20,000 | | Total Size | ~1.2 GB | | Audio Duration | ~16-28 hours | | Languages | 1 (Urdu) | | Voices | 13 per split | | Sample Rate | 22,050 Hz | | Bit Depth | 16-bit | | Average Sample Duration | 3-5 seconds | ### Per-Split Statistics | Metric | Munch v1 | Munch v2 | |--------|----------|----------| | Samples | ~10,000 | ~10,000 | | Size | ~600 MB | ~600 MB | | Source Dataset Size | 1.27 TB | 3.28 TB | | Source Total Samples | 4,167,500 | 3,856,500 | | Sampling Rate | ~0.24% | ~0.26% | ### Text Statistics (Estimated) | Metric | Range | |--------|-------| | Characters per Sample | 20-200 | | Words per Sample | 5-40 | | Average Text Length | ~50-80 characters | --- ## 🔗 Related Datasets ### Full Datasets | Dataset | Size | Samples | Link | |---------|------|---------|------| | **Munch v1** | 1.27 TB | 4.17M | [humair025/Munch](https://huggingface.co/datasets/humair025/Munch) | | **Munch-1 v2** | 3.28 TB | 3.86M | [humair025/munch-1](https://huggingface.co/datasets/humair025/munch-1) | ### Index Datasets (Metadata Only) | Dataset | Size | Purpose | Link | |---------|------|---------|------| | **Munch v1 Index** | ~1 GB | Fast exploration without audio | [humair025/hashed_data](https://huggingface.co/datasets/humair025/hashed_data) | | **Munch v2 Index** | ~1 GB | Fast exploration without audio | [humair025/hashed_data_munch_1](https://huggingface.co/datasets/humair025/hashed_data_munch_1) | ### Upgrade Path ``` Preview Dataset (4.36 GB) ↓ Test on preview, validate approach ↓ Index Datasets (~1 GB each) ↓ Explore metadata, plan subsets ↓ Full Datasets (1.27 TB / 3.28 TB) ↓ Production training ``` --- ## 📜 Licensing Information ### License This dataset is released under **Creative Commons Attribution 4.0 International (CC-BY-4.0)**. #### You are free to: - **Share**: Copy and redistribute the material in any medium or format - **Adapt**: Remix, transform, and build upon the material for any purpose, even commercially #### Under the following terms: - **Attribution**: You must give appropriate credit, provide a link to the license, and indicate if changes were made [Full License Text](https://creativecommons.org/licenses/by/4.0/legalcode) ### Citation Requirements If you use this dataset in your research or applications, please cite: ```bibtex @dataset{munch_preview_2025, title={Munch Preview: Quick Start Urdu Text-to-Speech Dataset}, author={ Humair Munir}, year={2025}, publisher={Hugging Face}, howpublished={\url{https://huggingface.co/datasets/humair025/munch_preview}}, note={Preview subsets of Munch v1 and Munch-1 v2 datasets} } ``` For the full datasets, also cite: ```bibtex @dataset{munch_v1_2025, title={Munch: Large-Scale Urdu Text-to-Speech Dataset}, author={ Humair Munir}, author={ humair025}, year={2025}, publisher={Hugging Face}, howpublished={\url{https://huggingface.co/datasets/humair025/Munch}} } @dataset{munch_v2_2025, title={Munch-1: Large-Scale Urdu Text-to-Speech Dataset}, author={ Humair Munir}, author={ humair025} year={2025}, publisher={Hugging Face}, howpublished={\url{https://huggingface.co/datasets/humair025/munch-1}} } ``` --- ## 👥 Dataset Curators **Created by**: Humair Munir **Organization**: Independent **Contact**: Available through HuggingFace dataset page --- ## 🙏 Acknowledgements This preview dataset is made possible by: - The creation of the full Munch and Munch-1 datasets - HuggingFace for dataset hosting infrastructure - Open-source TTS technology developers --- ## 📞 Contact & Support ### Questions or Issues? - **Dataset Issues**: Use the [Discussions tab](https://huggingface.co/datasets/humair025/munch_preview/discussions) - **Feature Requests**: Open an issue in Discussions - **Bug Reports**: Report in Discussions with detailed information ### Additional Resources - **Documentation**: This README - **Full Datasets**: See Related Datasets section above - **Community**: Join discussions on HuggingFace --- ## 📅 Changelog ### Version 1.0 (December 2025) - Initial release - ~4k samples from Munch v1 - ~4k samples from Munch-1 v2 - WAV format audio at 22,050 Hz - Complete metadata and documentation --- ## ⚡ Quick Reference ### At a Glance ```python # Installation pip install datasets soundfile # Load from datasets import load_dataset ds = load_dataset("humair025/munch_preview") # Explore print(ds) print(ds['munch_v1'][0]) # Play audio (Jupyter) import IPython.display as ipd sample = ds['munch_v1'][0] ipd.display(ipd.Audio(sample['audio']['array'], rate=22050)) ``` ### Key URLs - **This Dataset**: https://huggingface.co/datasets/humair025/munch_preview - **Munch v1 Full**: https://huggingface.co/datasets/humair025/Munch - **Munch v2 Full**: https://huggingface.co/datasets/humair025/munch-1 - **License**: https://creativecommons.org/licenses/by/4.0/ --- *Last Updated: December 2025*

提供机构：

humair025

5,000+

优质数据集

54 个

任务类型

进入经典数据集