TLMD: Tigrinya Language Modeling Dataset

NIAID Data Ecosystem2026-03-13 收录

下载链接：

https://zenodo.org/record/5139093

下载链接

链接失效反馈

官方服务：

资源简介：

A monolingual dataset built for Tigrinya language modeling. To the best of our knowledge, this is the largest dataset for Tigrinya of its kind. The data was collected from various sources across the web including news, blogs, and books. The largest portion of the data, ~75%, comes from over 2150 issues of the Haddas Ertra newspaper and other magazines published by www.shabait.com. Data Statistics: Total size: ~0.5GB Around 40 million tokens Over 2 million lines 367 unique characters Train split: 98%, 1.97 million lines Validation split: 2%, 43k lines We have done a light-weight cleanup of the data: - Removal of Tigrinya text with legacy and non-standard encoding systems - Normalization of punctuation and special characters - Removal of redundant white spaces and empty lines - Rejoining or fixing broken sentences when possible - Removal of foreign words We avoid applying any form of tokenization, extensive cleanup, and preprocessing operations in order not to take away potentially useful information, those decisions are left to the use-case researchers or developers. This dataset is shared solely to advance research on natural language processing for Tigrinya. While the dataset authors do not claim any copyright on the content, some of the original sources may do. To use the content for commercial purposes or other forms of redistribution of the data, permission shall be acquired from the original owners, mainly shabait.com.

创建时间：

2021-10-20

5,000+

优质数据集

54 个

任务类型

进入经典数据集