BRWDS: A Multipurpose Dataset For Bangla Regional Word Detection

Name: BRWDS: A Multipurpose Dataset For Bangla Regional Word Detection
Creator: doi.org
License: 暂无描述

doi.org2025-01-21 收录

下载链接：

http://doi.org/10.17632/6pd2c48m66.3

下载链接

链接失效反馈

官方服务：

资源简介：

The BRWDS (Bangla Regional Word Dataset) is a comprehensive collection of commonly used Bengali words that highlights the linguistic diversity across 8 distinct divisions in Bangladesh. This dataset aims to tackle the challenges posed by regional accents and variations in Bengali, which can create barriers to communication. The dataset covers words from the following divisions: Dhaka, Chittagong, Mymensingh, Sylhet, Rajshahi, Khulna, Barishal, and Rangpur. In total, it includes 347 Bengali words that are frequently used in daily conversations across these regions. While Bengali is spoken across all these divisions, each region has its own unique accent, leading to variations in pronunciation and word usage, which are captured in this dataset. To create this dataset, 12 native speakers from the 8 divisions, as well as one additional district, contributed by providing word samples. The data is stored in XLSX format, making it easily accessible for further research. This dataset has several potential applications, including the development of systems that can automatically detect regional variations in Bengali text, enabling better localization and understanding of regional dialects. It can also help minimize communication barriers caused by accent differences within Bangladesh by offering a more standardized understanding of regional variations. Additionally, the dataset can be used to translate regional words into standard Bengali (Chaste Bengali), making it easier for people to understand each other. The dataset also supports research into linguistic diversity and provides a foundation for future advancements in speech and text processing technologies. The dataset has been reviewed and evaluated by 9 authentic speakers from each division to ensure its accuracy and proper representation of the regional language variations. Looking forward, the dataset can be further enriched by adding voice data, which would support more advanced research in areas such as speech recognition, accent detection, and machine translation for regional language variants. Data was situated in Bangla RDS.xlsv . In sheet 1 named Region wise data was collected and evaluated on other sheet named categorize data where all the data was categorized and organize according to common chaste words.

BRWDS（孟加拉地区词汇数据集）乃一套囊括孟加拉语常用词汇的全面汇编，旨在彰显孟加拉国八个不同行政区之间的语言多样性。该数据集旨在应对孟加拉语中地域口音及变体带来的挑战，这些挑战可能构成沟通的障碍。数据集涵盖了以下行政区词汇：达卡、吉大港、迈门辛、锡尔赫特、拉杰沙希、库尔纳、巴里萨尔和朗布尔。总计包含347个在上述地区日常对话中频繁使用的孟加拉语词汇。尽管孟加拉语在这些行政区中普遍使用，但每个地区都有自己的独特口音，导致发音和词汇使用的差异，这些差异均被本数据集所捕捉。为构建此数据集，来自八个行政区及一个额外地区的12位母语者提供了词汇样本。数据以XLSX格式存储，便于进一步研究。该数据集具有多种潜在应用，包括开发能够自动检测孟加拉文文本中地域变体的系统，从而促进更好的本地化和对地域方言的理解。它还可以通过提供对地域变体的更标准化理解，减少由口音差异引起的沟通障碍。此外，该数据集可用于将地域词汇翻译为标准孟加拉语（纯正孟加拉语），从而使人们更容易相互理解。该数据集还支持对语言多样性的研究，并为语音和文本处理技术的未来进步奠定基础。数据集已由每个行政区9位认证讲者进行审查和评估，以确保其准确性及其对地域语言变体的恰当表征。展望未来，通过添加语音数据，该数据集可进一步丰富，这将支持在语音识别、口音检测和区域语言变体的机器翻译等领域进行更高级的研究。数据位于“Bangla RDS.xlsv”文件中，其中第1张工作表名为“按地区数据”，收集并评估了其他工作表中的“分类数据”，其中所有数据均按常见纯正词汇进行分类和组织。

提供机构：

doi.org

5,000+

优质数据集

54 个

任务类型

进入经典数据集