abdulhade/Kurdishcorpus
收藏Hugging Face2025-08-27 更新2025-10-25 收录
下载链接:
https://hf-mirror.com/datasets/abdulhade/Kurdishcorpus
下载链接
链接失效反馈官方服务:
资源简介:
KurCorpus 2B是一个包含超过20亿个token的多方言库尔德语文本语料库,用于大规模语言模型训练和下游NLP任务。包含Sorani (ckb), Kurmanji/Badini (kmr), Hawrami/Gorani (hac)三种方言。文本经过标准化和清洗,适用于预训练和微调库尔德语语言模型。
KurCorpus 2B is a multidialectal Kurdish text corpus (>2B tokens) for large-scale language modeling and downstream NLP tasks, including Sorani (ckb), Kurmanji/Badini (kmr), Hawrami/Gorani (hac) dialects. The text has been normalized and cleaned, making it suitable for pretraining and finetuning Kurdish language models.
提供机构:
abdulhade



