RonitMehta260704/kh-en-dataset
收藏Hugging Face2026-03-30 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/RonitMehta260704/kh-en-dataset
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
language:
- kha
- en
task_categories:
- automatic-speech-recognition
- text-to-speech
- translation
tags:
- khasi
- low-resource
- northeast-india
- bhashini
- ASR
- TTS
- speech-translation
- machine-translation
size_categories:
- 10K<n<100K
configs:
- config_name: asr_khasi
data_files:
- split: train
path: asr_khasi/train-*.parquet
- split: validation
path: asr_khasi/validation-*.parquet
default: true
- config_name: asr_english
data_files:
- split: train
path: asr_english/train-*.parquet
- split: validation
path: asr_english/validation-*.parquet
- config_name: speech_translation
data_files:
- split: train
path: speech_translation/train-*.parquet
- split: validation
path: speech_translation/validation-*.parquet
---
# KYNMAW — Khasi Bhashini Multi-Task Dataset
**Team:** ITerative Bytes | **Hackathon:** Bhashini Khasi Language Model Training
**Roles:** Ronit Satish Mehta (MLOps), Vipul Mhatre (ASR/TTS), Varun Vyas (NLP/MT),
Ankush Pandey (Annotation), Saniyaa Shetty (AI Architect)
---
## Dataset Description
Multi-task audio + text dataset for **Khasi** (`kha`), a low-resource Austroasiatic
language of Meghalaya, Northeast India. Sourced from **AIR Shillong** (All India Radio)
broadcasts via Prasar Bharati's public archive.
Built to support all 4 Bhashini model training tasks:
| Config | Task | Input | Output |
|---|---|---|---|
| `asr_khasi` | Speech Recognition | Khasi audio (16kHz WAV) | Khasi transcript |
| `asr_english` | Speech Recognition | English audio (16kHz WAV) | English transcript |
| `speech_translation` | Speech Translation | English audio | Khasi text |
| `machine_translation` | Machine Translation | English text | Khasi text |
---
## Usage
```python
from datasets import load_dataset
# Task 1 & 2 — ASR / TTS (Khasi)
ds_asr = load_dataset("RonitMehta260704/kh-en-dataset", "asr_khasi")
print(ds_asr["train"][0])
# {'audio': {'array': [...], 'sampling_rate': 16000},
# 'duration_sec': 8.3, 'transcript_khasi': '', ...}
# Task 3 — Machine Translation
ds_mt = load_dataset("RonitMehta260704/kh-en-dataset", "machine_translation")
print(ds_mt["train"][0])
# {'transcript_english': '...', 'transcript_khasi': '...', 'date_key': '...'}
# Task 4 — Speech Translation
ds_st = load_dataset("RonitMehta260704/kh-en-dataset", "speech_translation")
print(ds_st["train"][0])
# {'audio_english': {...}, 'audio_khasi': {...}, 'translation_en2kha': ''}
```
---
## Data Collection Methodology (KYNMAW Pipeline)
```
AIR Shillong Archive (250 pages)
│
▼
[Scrape] English 0830 + Khasi 0745 bulletins per date
│
▼
[Process] Noise Removal → Silence Trimming → 16kHz Resampling
│
▼
[Segment] 2–20 second clips via silence detection (ffmpeg silencedetect)
│
▼
[Pair] English ↔ Khasi matched by broadcast date
│
▼
[Upload] 4 Bhashini-compatible HF dataset configs
```
---
## Annotation Status
Transcripts are **empty by default**. Fill via:
- Native Khasi speaker annotation (recommended)
- Whisper pseudo-labelling (English only, as bootstrap)
- Bhashini ULCA crowdsourcing portal
---
## Compatible Models (KYNMAW Stack)
| Task | Model |
|---|---|
| ASR | wav2vec 2.0 / XLS-R |
| TTS | FastSpeech 2 + HiFi-GAN |
| MT | IndicTrans2 / mBART fine-tune |
| ST | Seq2Seq with cross-lingual encoder |
---
## Citation
```bibtex
@dataset{kynmaw_khasi_bhashini_2025,
author = {ITerative Bytes},
title = {KYNMAW Khasi Bhashini Multi-Task Dataset},
year = {2025},
publisher = {Hugging Face},
url = {https://huggingface.co/datasets/RonitMehta260704/kh-en-dataset}
}
```
## License
CC BY 4.0 — non-commercial linguistic research only.
Source: Prasar Bharati / All India Radio (public broadcaster).
提供机构:
RonitMehta260704



