hmBlogs
收藏arXiv2021-11-04 更新2024-06-21 收录
下载链接:
http://fs.nlp.sbu.ac.ir/members/motahari/metr/papers/hmBlogs/
下载链接
链接失效反馈官方服务:
资源简介:
hmBlogs是一个大规模的波斯语语料库,由自然语言处理研究实验室和计算机科学与工程系在沙希德贝赫什提大学创建。该数据集基于近2000万个博客文章,涵盖了约15年的波斯语博客空间,包含超过68亿个词条。hmBlogs不仅提供原始文本,还提供预处理后的文本,用于生成词嵌入模型。通过与其他重要波斯语语料库的比较,hmBlogs在多项评估中表现出优越性。该数据集的应用领域包括语言模型训练、语义分析和词嵌入模型的评估,旨在解决波斯语作为低资源语言在自然语言处理领域的挑战。
hmBlogs is a large-scale Persian corpus created by the Natural Language Processing Research Lab and the Department of Computer Science and Engineering at Shahid Beheshti University. This corpus is built on nearly 20 million blog posts, covering approximately 15 years of the Persian blogosphere, and contains over 6.8 billion tokens. In addition to raw text, hmBlogs also provides preprocessed text for constructing word embedding models. Through comparative evaluation against other prominent Persian corpora, hmBlogs has demonstrated superior performance across multiple assessment benchmarks. Its application scenarios include language model training, semantic analysis, and evaluation of word embedding models, aiming to address the challenges faced by Persian as a low-resource language in the field of natural language processing.
提供机构:
自然语言处理研究实验室,计算机科学与工程系,沙希德贝赫什提大学,德黑兰,伊朗
创建时间:
2021-11-04



