five

KTC-库尔德语教科书语料库

收藏
arXiv2019-09-25 更新2024-06-21 收录
下载链接:
https://github.com/KurdishBLARK/KTC
下载链接
链接失效反馈
官方服务:
资源简介:
KTC-库尔德语教科书语料库是由库尔德斯坦大学创建的一个细粒度语料库,专注于库尔德语的Sorani方言。该数据集包含31本K-12教科书,涵盖12个教育主题,总计693,800个tokens。创建过程中,数据从多种格式转换为Unicode,并进行了标准化处理。KTC旨在支持库尔德语的自然语言处理任务,特别是语言建模和语法错误修正,为库尔德语的进一步研究和应用提供了重要资源。

The KTC-Kurdish Textbook Corpus is a fine-grained corpus developed by the University of Kurdistan, focusing on the Sorani dialect of Kurdish. This dataset contains 31 K-12 textbooks spanning 12 educational topics, with a total of 693,800 tokens. During its construction, the raw data was converted from multiple formats to Unicode and standardized. The KTC aims to support natural language processing tasks for Kurdish, especially language modeling and grammatical error correction, serving as a critical resource for subsequent research and practical applications of the Kurdish language.
提供机构:
库尔德斯坦大学
创建时间:
2019-09-25
二维码
社区交流群
二维码
科研交流群
商业服务