TLMD: Tigrinya Language Modeling Dataset
收藏NIAID Data Ecosystem2026-03-13 收录
下载链接:
https://zenodo.org/record/5139093
下载链接
链接失效反馈官方服务:
资源简介:
A monolingual dataset built for Tigrinya language modeling. To the best of our knowledge, this is the largest dataset for Tigrinya of its kind. The data was collected from various sources across the web including news, blogs, and books. The largest portion of the data, ~75%, comes from over 2150 issues of the Haddas Ertra newspaper and other magazines published by www.shabait.com.
Data Statistics:
Total size: ~0.5GB
Around 40 million tokens
Over 2 million lines
367 unique characters
Train split: 98%, 1.97 million lines
Validation split: 2%, 43k lines
We have done a light-weight cleanup of the data:
- Removal of Tigrinya text with legacy and non-standard encoding systems
- Normalization of punctuation and special characters
- Removal of redundant white spaces and empty lines
- Rejoining or fixing broken sentences when possible
- Removal of foreign words
We avoid applying any form of tokenization, extensive cleanup, and preprocessing operations in order not to take away potentially useful information, those decisions are left to the use-case researchers or developers.
This dataset is shared solely to advance research on natural language processing for Tigrinya. While the dataset authors do not claim any copyright on the content, some of the original sources may do. To use the content for commercial purposes or other forms of redistribution of the data, permission shall be acquired from the original owners, mainly shabait.com.
创建时间:
2021-10-20



