five

WoLLaI Mal-Eng: Word Level Language Identification of Malayalam-English Code-Mixed Text

收藏
DataCite Commons2025-04-01 更新2025-04-16 收录
下载链接:
https://data.mendeley.com/datasets/tzrcrrwz4n
下载链接
链接失效反馈
官方服务:
资源简介:
WoLLaI Mal-Eng is a carefully curated and annotated dataset, particularly for word-level language identification in Malayalam-English code-mixed text. The dataset consists of a set of 12,402 sentences, thoroughly tokenized for optimal representation. The dataset file is organized into three columns such as sentence#, words, and language. Language annotation is thoughtfully categorized into four distinct classes: Mal, Eng, Mix, and Othr. The words that belong to the Malayalam language and are recognized by Malayalam speakers are annotated as Mal. The words that belong to the English language and are easily recognized by English speakers are annotated as Eng. Words that are formed by combining Malayalam and English words where Malayalam suffixes were added to the end of English words or parts of English to enhance comprehension for Malayalam speakers are annotated as Mix. The words of diverse elements such as numbers, abbreviations, and named entities are annotated as Othr.

WoLLaI Mal-Eng是一个精心构建并标注的数据集,尤其适用于马拉雅拉姆语(Malayalam)-英语混合代码文本中的词级语言识别。该数据集包含12402个句子,均经过彻底的Token化处理以实现最优表示。数据集文件被组织为三列,即sentence#、words和language。语言标注被细致地划分为四个不同类别:Mal、Eng、Mix和Othr。属于马拉雅拉姆语且为马拉雅拉姆语使用者所识别的词汇被标注为Mal;属于英语且为英语使用者易于识别的词汇被标注为Eng;通过在英语词汇或其片段末尾添加马拉雅拉姆语后缀、以提升马拉雅拉姆语使用者理解度的混合词汇被标注为Mix;包含数字、缩写和命名实体等多种元素的词汇被标注为Othr。
提供机构:
Mendeley Data
创建时间:
2024-01-15
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作