aman-hf/indic_asr
收藏Hugging Face2026-03-06 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/aman-hf/indic_asr
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
task_categories:
- automatic-speech-recognition
language:
- hi2
size_categories:
- 10M<n<100M
configs:
- config_name: default
data_files:
- split: train
path: "data/*/*.parquet"
- config_name: hi2
data_files:
- split: train
path: "data/hi2/*.parquet"
dataset_info:
features:
- name: audio
dtype: audio
- name: text
dtype: string
- name: duration
dtype: float64
- name: language
dtype: string
- name: source
dtype: string
---
# Indic ASR Unified Dataset
Unified collection of Indian language ASR datasets for pretraining.
## Stats
- **Total hours:** 10,278
- **Total samples:** 4,732,705
- **Languages:** 1
- **Audio:** 16kHz mono (mixed flac/mp3/wav)
## Languages
| Language | Hours | Samples |
|----------|-------|---------|
| hi2 | 10,278 | 4,732,705 |
## Usage
```python
from datasets import load_dataset
# Load all languages (streaming)
ds = load_dataset("aman-hf/indic_asr", streaming=True, split="train")
# Load specific language
ds_hi = load_dataset("aman-hf/indic_asr", "hi", streaming=True, split="train")
```
## Schema
- `audio`: Audio bytes (16kHz mono)
- `text`: Transcription text
- `duration`: Duration in seconds
- `language`: ISO language code
- `source`: Original dataset name
提供机构:
aman-hf



