hrwac
收藏Opencsg2024-07-19 更新2025-05-03 收录
下载链接:
https://www.opencsg.com/datasets/AIWizards/hrwac
下载链接
链接失效反馈官方服务:
资源简介:
HrWac语料库主要面向克罗地亚语,通过爬取.hr顶级域名构建,包含2011年和2014年的数据。该语料库规模较大,包含数十亿级别的token。数据经过段落级别的去重、变音符号恢复标准化、词性标注和词形还原等处理,并按段落打乱。每段包含URL、域名和语言识别(克罗地亚语vs.塞尔维亚语)等元数据,主要用于文本生成和Masked Language Modeling等任务,并采用CC-BY-SA 3.0协议授权。
The HrWac corpus is primarily focused on the Croatian language, built by crawling content from the .hr top-level domain, and covers data from 2011 and 2014. It is a large-scale corpus with billions of tokens. The corpus data has undergone processing including paragraph-level deduplication, diacritic restoration and standardization, part-of-speech tagging, and lemmatization, and has been shuffled at the paragraph level. Each paragraph includes metadata such as URL, domain name, and language identification (differentiating between Croatian and Serbian). It is mainly applied to tasks including text generation and Masked Language Modeling, and is licensed under CC-BY-SA 3.0.
创建时间:
2024-07-19



