BengVoice: A Stratified Dataset of Code-Mixed Bengali-English Voice Commands for Intent Classification in Conversational AI Systems

Mendeley Data2026-04-18 收录

下载链接：

https://data.mendeley.com/datasets/sr99ryf4ns

下载链接

链接失效反馈

官方服务：

资源简介：

This dataset presents a meticulously curated benchmark collection of 1,200 Bengali voice assistant utterances for intent classification research in conversational AI systems. BengVoice addresses the critical gap in Natural Language Understanding resources for Bengali, one of the world's most widely spoken languages with over 230 million speakers, yet significantly underrepresented in publicly available language technology datasets. The dataset comprises utterances across 10 fundamental voice assistant intent categories: weather queries, time queries, alarm setting, news requests, music playback, phone calls, messaging, translation, calculations, and general knowledge questions. Each intent category contains exactly 120 samples, ensuring perfect class balance. All 1,200 utterances are unique with zero duplicates. A distinguishing feature is authentic code-mixing behaviour—natural integration of English words within Bengali speech. Analysis reveals 290 samples (24.2%) contain code-mixed content, with patterns reflecting genuine usage: technical domains like alarm setting show 71.7% code-mixing, while traditional domains show minimal mixing (0.8%). This reflects natural speech patterns of urban Bengali speakers in Bangladesh. The dataset incorporates cultural authenticity through references to Bangladeshi locations (Dhaka, Chittagong, Sylhet), local media (Prothom Alo, Kaler Kantho), and cultural elements specific to Bangladesh, ensuring real-world usage scenarios for Bengali-speaking populations. For robust evaluation, the dataset provides stratified 5-fold cross-validation splits. Each fold contains exactly 240 samples with 24 per intent, maintaining perfect balance. This stratification enables fair model comparison and supports multiple evaluation methodologies including traditional machine learning, deep learning, retrieval-augmented generation (RAG), and few-shot prompting. Baseline validation experiments using TF-IDF vectorization with character-level n-grams and Logistic Regression achieved mean accuracy of 93.92% (±0.50%) across 5-fold cross-validation, with fold accuracies from 93.33% to 94.58%. Per-intent performance ranged from 81.67% (news requests) to 100% (translation), establishing clear benchmarks and validating dataset quality. The dataset is provided in multiple formats: complete datasets in JSON and CSV (with and without fold labels), individual fold files for pre-separated evaluation. No proprietary software required. This resource enables Bengali voice assistant development, intent classification benchmarking, code-mixing investigation, cross-lingual transfer learning, multilingual NLU systems, and low-resource language processing. Released under Creative Commons Attribution 4.0 International (CC BY 4.0) license for maximum research impact.

创建时间：

2026-02-26

5,000+

优质数据集

54 个

任务类型

进入经典数据集