five

SeLLaI Mal-Eng: Sentence Level Language Identification of Malayalam-English Code-Mixed Text

收藏
Mendeley Data2024-03-27 更新2024-06-27 收录
下载链接:
https://data.mendeley.com/datasets/5p4zbpy8wz
下载链接
链接失效反馈
官方服务:
资源简介:
SeLLaI Mal-Eng is a thoroughly curated and annotated dataset designed for sentence-level language identification in Malayalam-English code-mixed text. The dataset comprises 22,400 sentences composed using English alphabets. The dataset file is organized into two columns: sentence and language. The language annotation is categorized into three distinct classes: Manglish, Code-Mixed, and English. The sentences that belong to the Malayalam language and are composed using English alphabets are annotated as Manglish. Manglish sentences may consist of either only Malayalam words written in English alphabets or a combination of Malayalam and English words. Sentences containing words formed by combining Malayalam and English words where Malayalam suffixes are added to the end of English words or parts of English to enhance comprehension for Malayalam speakers are annotated as Code-Mixed. Sentences belonging to the English language and easily recognizable by English speakers are annotated as English.
创建时间:
2024-01-23
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作