Sindhi语言大型语料库

Name: Sindhi语言大型语料库
Creator: 电子科技大学计算机科学与工程学院
Published: 2020-12-30 11:50:16
License: 暂无描述

arXiv2020-12-30 更新2024-08-06 收录

下载链接：

http://arxiv.org/abs/1911.12579v3

下载链接

链接失效反馈

官方服务：

资源简介：

本研究为资源匮乏的Sindhi语言开发了一个包含超过6100万单词的大型语料库。该语料库通过网络爬虫从多个网络资源中收集，并经过精心预处理以过滤噪声文本。语料库的创建解决了Sindhi语言在自然语言处理（NLP）领域缺乏大规模未标注语料的问题，为训练神经词嵌入提供了基础。此外，本研究还利用了GloVe、Skip-Gram和Continuous Bag of Words等先进的词2vec算法来生成Sindhi词嵌入，并通过内在评估方法如余弦相似度矩阵和WordSim-353来评估生成的词嵌入质量。此语料库及其相关的词嵌入模型为Sindhi语言的统计语言处理（SSLP）应用提供了重要的资源和工具。

This study developed a large corpus containing over 61 million words for the under-resourced Sindhi language. This corpus was collected from multiple web resources via web crawling, and underwent rigorous preprocessing to filter noisy text. The creation of this corpus addressed the shortage of large-scale unannotated corpora for the Sindhi language in the field of natural language processing (NLP), providing a foundation for training neural word embeddings. In addition, this study employed advanced word2vec algorithms including GloVe, Skip-Gram, and Continuous Bag of Words to generate Sindhi word embeddings, and evaluated the quality of the generated word embeddings via intrinsic evaluation methods such as the cosine similarity matrix and WordSim-353. This corpus and its associated word embedding models offered critical resources and tools for applications in statistical language processing of the Sindhi language (SSLP).

提供机构：

电子科技大学计算机科学与工程学院

创建时间：

2019-11-28