Tanushreeeeee/COMI-LINGUA
收藏Hugging Face2025-12-12 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/Tanushreeeeee/COMI-LINGUA
下载链接
链接失效反馈官方服务:
资源简介:
COMI-LINGUA是一个高质量的印地语-英语代码混合数据集,由三位标注者手动标注,旨在为多语言NLP模型提供基准测试。该数据集涵盖了多个基础NLP任务,包括语言识别(LID)、矩阵语言识别(MLI)、词性标注(POS)、命名实体识别(NER)和机器翻译(MT)。数据集由IIT Gandhinagar的Lingo研究小组策划,由SERB资助,使用cc-by-4.0许可,包含双语(印地语和英语)内容。
COMI-LINGUA is a high-quality Hindi-English code-mixed dataset, manually annotated by three annotators. It serves as a benchmark for multilingual NLP models by covering multiple foundational tasks such as Language Identification (LID), Matrix Language Identification (MLI), Part-of-Speech (POS) Tagging, Named Entity Recognition (NER), and Machine Translation (MT). The dataset is curated by the Lingo Research Group at IIT Gandhinagar, funded by SERB, and licensed under cc-by-4.0. It contains bilingual content in Hindi and English.
提供机构:
Tanushreeeeee



