five

Tanushreeeeee/COMI-LINGUA

收藏
Hugging Face2025-12-12 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/Tanushreeeeee/COMI-LINGUA
下载链接
链接失效反馈
官方服务:
资源简介:
COMI-LINGUA是一个高质量的印地语-英语代码混合数据集,由三位标注者手动标注,旨在为多语言NLP模型提供基准测试。该数据集涵盖了多个基础NLP任务,包括语言识别(LID)、矩阵语言识别(MLI)、词性标注(POS)、命名实体识别(NER)和机器翻译(MT)。数据集由IIT Gandhinagar的Lingo研究小组策划,由SERB资助,使用cc-by-4.0许可,包含双语(印地语和英语)内容。

COMI-LINGUA is a high-quality Hindi-English code-mixed dataset, manually annotated by three annotators. It serves as a benchmark for multilingual NLP models by covering multiple foundational tasks such as Language Identification (LID), Matrix Language Identification (MLI), Part-of-Speech (POS) Tagging, Named Entity Recognition (NER), and Machine Translation (MT). The dataset is curated by the Lingo Research Group at IIT Gandhinagar, funded by SERB, and licensed under cc-by-4.0. It contains bilingual content in Hindi and English.
提供机构:
Tanushreeeeee
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作