Nepali Text Corpus
收藏Mendeley Data2024-01-31 更新2024-06-29 收录
下载链接:
https://ieee-dataport.org/open-access/nepali-text-corpus
下载链接
链接失效反馈官方服务:
资源简介:
Considering the ongoing works in Natural Language Processing (NLP) with the Nepali language, it is evident that the use of Artificial Intelligence and NLP on this Devanagari script has still a long way to go. The Nepali language is complex in itself and requires multi-dimensional approaches for pre-processing the unstructured text and training the machines to comprehend the language competently. There seemed a need for a comprehensive Nepali language text corpus containing texts from domains such as News, Finance, Sports, Entertainment, Health, Literature, Technology. Therefore to address this necessity, a Nepali text corpus of over 90 million running words (6.5+ million sentences) was compiled.And then, 300-dimensional word vectors for more than 0.5 million Nepali words/phrases were computed. The word vectors can be downloaded from here: http://dx.doi.org/10.21227/dz6s-my90.The WorkThe collection of the texts for building the corpus was done by scrapping the news portals available freely in the public domain. Any random website or blog could not be scrapped because of the variability in wordings for the same word. Therefore, it was necessary to bring some consistency within the collected texts. It was observed that the authority Nepali news portals such as Kantipur, Nagariknews, Setopati, etc. were among the ones retiring this problem by a significant margin. These portals were using the most common wording while scripting the articles and had relatively lesser grammatical errors & ambiguity within the language structure.The news portals scrapped for designing the text corpus included: Ekantipur, Nagariknews, Setopati, Onlinekhabar, Karobardaily, Ratopati, News24nepal, Reportersnepal, Baahrakhari, Hamrokhelkud and Aakarpost. The articles posted on these portals are copyright to the respective publishers. The compiled text corpus should only be used for non-commercial research projects.Designing the corpus:A web crawler was written in Python. The crawler was responsible for downloading the source files and dumping them on the local machine in a separate folder for different websites.The source files in a folder were merged as blocks for further processing. A block file contained 1000 article pages. Once a block is created, the following items were removed from the file: HTML elements, English alphabets and numbers (a-zA-Z0-9), symbols and unnecessary spaces.StemmingFinding root words in the Nepali language is a complex task. However, this can be achieved with a satisfactory outcome by slicing the trailing Devanagari characters. For this purpose, a dictionary of trailing characters should be maintained, and each word (tokenized) in the corpus obtained from the above pre-processing phase should be checked for the presence of the trailing characters.The NLTK library does not support the Nepali language out-of-the-box. However, a small workaround can be done to use the NLTK tokenize function on the Nepali text corpus. In Nepali, the sentences can be found separated by any of these characters: `purnabiram', `question mark', `exclamation sign'. These characters can be replaced with `dot" (.) to make NLTK tokenize the text corpus at both sentence and word level. Word VectorsThe text corpus was used to find the most frequently used words (stop words) in the Nepali language. The top 1500 most frequent words were extracted. The tokenized words from the corpus which were present in the list of stop words were removed. The pre-processed corpus after stemming and stop words removal was finally ready to be used for the computation of word vectors.After pre-processing, the text corpus contained 6.5+ million sentences and 90+ million words. The 300-dimensional word vectors (Word2Vec) were computed using Gensim with following properties: Architecture: Continuous - BOW; Training algorithm: Negative sampling = 15; Context (window) size: 10; Token minimum count: 2. The designed Word2Vec model (filetype: txt) for the Nepali language is of 1.8GB with encoding done in UTF-8. The model can be loaded with the binary option set to false.
针对尼泊尔语自然语言处理(Natural Language Processing,以下简称NLP)领域的现有研究进展来看,基于天城文(Devanagari script)脚本的人工智能与自然语言处理技术的落地应用仍有较长的发展空间。尼泊尔语本身语言结构复杂,需采用多维度方法对非结构化文本进行预处理,并训练机器以实现对该语言的高效理解。当前亟需构建涵盖新闻、金融、体育、娱乐、健康、文学、科技等多领域文本的综合性尼泊尔语文本语料库。为此,研究团队编译了包含超9000万连续词(650万+句)的尼泊尔语文本语料库,并基于此计算了覆盖超50万个尼泊尔语单词/短语的300维词向量。该词向量可通过以下链接下载:http://dx.doi.org/10.21227/dz6s-my90。
### 语料采集工作
本次语料构建所需文本通过抓取公开领域内可免费访问的新闻门户网站获取。由于同一词汇可能存在多种表述方式,无法随意抓取任意网站或博客的内容,因此需确保采集文本的表述一致性。经观察,《坎提普尔》(Kantipur)、《纳加里克新闻》(Nagariknews)、《塞托帕蒂》(Setopati)等权威尼泊尔语新闻门户网站在这方面表现优异:其文章撰写采用通用规范表述,语言结构中的语法错误与歧义相对较少。本次用于构建语料库的新闻门户网站包括:Ekantipur、Nagariknews、Setopati、Onlinekhabar、Karobardaily、Ratopati、News24nepal、Reportersnepal、Baahrakhari、Hamrokhelkud及Aakarpost。上述门户网站发布的文章版权归各自出版方所有,编译得到的文本语料库仅可用于非商业性研究项目。
### 语料库预处理流程
研究团队使用Python编写了网络爬虫,负责下载源文件并按不同网站分类存储至本地独立文件夹。将单个文件夹内的源文件合并为数据块以进行后续处理,每个数据块文件包含1000篇文章页面。数据块创建完成后,需从中移除HTML元素、英文字母与数字(a-zA-Z0-9)、符号及多余空格。
### 词干提取
尼泊尔语词根提取任务复杂度较高,但通过截去尾随的天城文字符可获得较为理想的效果。为此,需维护一个尾随字符字典,并对上一预处理阶段得到的语料库中每个分词后的单词,检查其是否包含上述尾随字符。自然语言工具包(Natural Language Toolkit,以下简称NLTK)默认不支持尼泊尔语,但可通过简单适配实现其分词功能在尼泊尔语语料库上的应用:尼泊尔语句子通常由尼泊尔语句号(purnabiram)、问号、感叹号分隔,可将这些字符替换为英文句号(.),以实现NLTK对语料库的句子级与词级分词。
### 词向量计算
基于该文本语料库,研究团队提取了尼泊尔语中使用频率最高的词汇(停用词),并选取前1500个高频词作为停用词表。随后移除语料库中分词后的、属于停用词表的词汇。经过词干提取与停用词过滤后的预处理语料库,即可用于词向量计算。预处理完成后,该语料库包含650万+句文本与9000万+词。研究团队使用Gensim工具包计算得到300维词向量(Word2Vec),相关参数设置如下:模型架构:连续词袋模型(Continuous-BOW);训练算法:负采样数=15;上下文窗口大小:10;最小词频阈值=2。本次构建的尼泊尔语Word2Vec模型文件大小为1.8GB,格式为纯文本文件(.txt),编码采用UTF-8,加载时需将二进制参数设为false。
创建时间:
2024-01-31



