five

Devvrat024/Rural_Women_Bhojpuri

收藏
Hugging Face2026-04-02 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/Devvrat024/Rural_Women_Bhojpuri
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - bho - hi license: cc-by-sa-4.0 task_categories: - automatic-speech-recognition pretty_name: Rural Bhojpuri ASR Dataset dataset_info: features: - name: age_group dtype: string - name: audio dtype: audio: sampling_rate: 16000 - name: district dtype: string - name: duration dtype: float64 - name: job_type dtype: string - name: lang dtype: string - name: language dtype: string - name: prompt_text dtype: string - name: qualification dtype: string - name: scenario dtype: string - name: speaker_id dtype: string - name: state dtype: string - name: task_name dtype: string - name: text dtype: string splits: - name: benchmark num_bytes: 400115473 num_examples: 444 - name: train_real num_bytes: 460882675 num_examples: 400 - name: train_synthetic num_bytes: 39573449568 num_examples: 77967 download_size: 34075320626 dataset_size: 40434447716 configs: - config_name: default data_files: - split: benchmark path: data/benchmark-* - split: train_real path: data/train_real-* - split: train_synthetic path: data/train_synthetic-* --- # Rural Bhojpuri ASR Dataset ## Dataset Description This dataset is curated to foster the development of inclusive Automatic Speech Recognition (ASR) systems, with a special focus on the underrepresented voices of rural Bhojpuri women. It contains audio clips in both Bhojpuri and Hindi, collected from real-world and synthetic sources, designed to train and evaluate ASR models that can accurately recognize diverse speech patterns. This work is part of the research presented in the paper "Recognizing Every Voice: Towards Inclusive ASR for Rural Bhojpuri Women." ## How to Use The dataset can be easily loaded using the Hugging Face `datasets` library. ```python from datasets import load_dataset # Load the dataset dataset = load_dataset("ai4bharat/Rural__Women_Bhojpuri") # Access a specific split train_real_split = dataset["train_real"] # Print the first example print(train_real_split[0]) # The audio will be automatically decoded and resampled to 16kHz # Example: {'audio': {'path': '...', 'array': array([-0.00024414, -0.00048828, ...], dtype=float32), 'sampling_rate': 16000}, 'text': '...', ...} ``` ## Citation If you use this dataset in your research, please cite the following paper: ``` @misc{joshi2025recognizingvoiceinclusiveasr, title={Recognizing Every Voice: Towards Inclusive ASR for Rural Bhojpuri Women}, author={Sakshi Joshi and Eldho Ittan George and Tahir Javed and Kaushal Bhogale and Nikhil Narasimhan and Mitesh M. Khapra}, year={2025}, eprint={2506.09653}, archivePrefix={arXiv}, primaryClass={eess.AS}, url={[https://arxiv.org/abs/2506.09653](https://arxiv.org/abs/2506.09653)}, } ```
提供机构:
Devvrat024
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作