Lists of Karakalpak Stopwords
收藏NIAID Data Ecosystem2026-03-14 收录
下载链接:
https://zenodo.org/record/7497972
下载链接
链接失效反馈官方服务:
资源简介:
The dataset presents 3 lists of stopwords in the Karakalpak language. The lists were constructed using three automatic methods applied to the same corpus.
The corpus was constructed by obtaining a source of 23 school textbooks, it was named "Karakalpak School Corpus". The corpus can be re-constructed using the list of urls of all files comprised in the corpus. The list is part of the dataset (list_of_urls_for_karakalpak_school_corpus.txt).
Description of the methods and the lists:
A set of grammar rules and the TDIDF algorithm were used to automatically collect a list of single-word stopwords. 4014 stopwords were collected. The name of the file: Karakalpak_stopwords_unigrams.txt.
A bigram method was used to extract a list of 3740 bigrams (pairs) of stopwords. The name of the file: Karakalpak_stopwords_bigram.txt.
A set of two-word collocations as stopwords was also extracted. The list has 20745 pairs of stopwords. The name of the file: Karakalpak_stopwords_bigrams_with_collocations.txt.
本数据集包含3组卡拉卡尔帕克语(Karakalpak)停用词列表,所有列表均基于同一语料库,通过三种自动化方法构建得到。
该语料库由23本学校教科书汇编而成,被命名为“卡拉卡尔帕克学校语料库(Karakalpak School Corpus)”。用户可通过语料库包含的所有文件的URL列表重新构建该语料库,该URL列表已作为数据集的组成部分提供,文件名为list_of_urls_for_karakalpak_school_corpus.txt。
方法及停用词列表详情如下:
1. 针对单字停用词,采用语法规则集与TDIDF算法进行自动化采集,共获取4014个停用词,对应文件名为Karakalpak_stopwords_unigrams.txt。
2. 采用二元组方法提取得到3740个二元停用词对,对应文件名为Karakalpak_stopwords_bigram.txt。
3. 同时提取以双词搭配作为停用词的列表,共包含20745个停用词对,对应文件名为Karakalpak_stopwords_bigrams_with_collocations.txt。
创建时间:
2023-03-02



