"Kasaragod Dialect Translation Dataset: Mapping Regional Dialect Words to Standard Malayalam and English"
收藏DataCite Commons2026-03-12 更新2026-05-03 收录
下载链接:
https://ieee-dataport.org/documents/kasaragod-dialect-translation-dataset-mapping-regional-dialect-words-standard-malayalam
下载链接
链接失效反馈官方服务:
资源简介:
"This dataset presents a structured trilingual lexical resource for the Kasaragod Malayalam dialect, a regional dialect spoken in the Kasaragod district of Kerala, India. Kasaragod is widely recognized as the \u201cLand of Seven Languages,\u201d where several languages coexist and interact in everyday communication. These languages include Malayalam, Kannada, Tulu, Konkani, Beary, Urdu, and Marathi. The continuous linguistic interaction among these languages has led to the evolution of the Kasaragod Malayalam dialect, which contains distinctive phonetic, lexical, and semantic variations when compared with Standard Malayalam.The dataset contains 1,577 lexical entries, where each entry maps a Kasaragod dialect word to its Standard Malayalam equivalent and its English translation. The dataset was compiled using multiple sources to ensure linguistic accuracy and cultural authenticity. A major source of dialect vocabulary is the regional dictionary Ponjar Nattubasha Nighandu by Ambikhasuthan Mangad, which documents several traditional Kasaragod dialect expressions. Additional data were collected through field interactions with native speakers, linguistic validation by local experts, and analysis of dialect usage in digital media and regional literature.This dataset can support research in Natural Language Processing (NLP), low-resource language processing, dialect translation, multilingual lexical analysis, and linguistic documentation. It is particularly useful for developing machine learning models for dialect translation, character-level language modeling, and multilingual language technologies. Furthermore, the dataset contributes to the preservation and digital documentation of the Kasaragod dialect and supports the development of inclusive language technologies for linguistically diverse communities.By providing one of the first structured computational resources for the Kasaragod dialect, this dataset aims to facilitate research on dialectal variation, multilingual linguistic ecosystems, and low-resource language technologies."
提供机构:
IEEE DataPort
创建时间:
2026-03-12



