Tigrinya Diverse Genre Corpus (TiDG) for Text Categorization
收藏IEEE2026-04-17 收录
下载链接:
https://ieee-dataport.org/documents/tigrinya-diverse-genre-corpus-tidg-text-categorization-0
下载链接
链接失效反馈官方服务:
资源简介:
The Tigrinya Diverse Genre Corpus (TiDG) constitutes a labeled multiclass Tigrinya corpus for the low-resource Semitic language. TiDG consists of 8,067 documents distributed into seven classes, each of which comes with a cleaned Tigrinya script and SERA transliterated variant. The labeled set consists of training, evaluation, and test sets for facilitating reliable machine learning assessment. The current dataset fills a significant gap by providing a benchmark corpus for Tigrinya, thus propelling natural language processing research for minority languages. TiDG accommodates applications such as automatic chunking from classifiable texts, document-based recommendation for topics, user-generated text intent determination, and focus-based information extraction. The code for the label encoder comes as a simple facilitator for the transformation of class labels into numeric representations for developing the model. The TiDG corpus comes as a significant contributor towards the development and evaluation of text classificatory models for minority languages. By providing a well-annotated corpus for Tigrinya, TiDG fills the gap for benchmark resources for minority language studies. Its provision spurs new solutions, unbiased comparison of machine learning methods, as well as increased representation of minority languages for multilingual NLP. Consequently, TiDG collaborates towards the development of more inclusive language tools for enhancing linguistic diversity as well as digital access.
提供机构:
Huansheng Ning; Doreen Sebastian Sarwatt; Daniel Tesfai Gebretatios; Jianguo Ding



