five

Lists of Karakalpak Stopwords

收藏
NIAID Data Ecosystem2026-03-14 收录
下载链接:
https://zenodo.org/record/7497972
下载链接
链接失效反馈
官方服务:
资源简介:
The dataset presents 3 lists of stopwords in the Karakalpak language. The lists were constructed using three automatic methods applied to the same corpus.  The corpus was constructed by obtaining a source of 23 school textbooks, it was named "Karakalpak School Corpus". The corpus can be re-constructed using the list of urls of all files comprised in the corpus. The list is part of the dataset (list_of_urls_for_karakalpak_school_corpus.txt). Description of the methods and the lists: A set of grammar rules and the TDIDF algorithm were used to automatically collect a list of single-word stopwords. 4014 stopwords were collected. The name of the file: Karakalpak_stopwords_unigrams.txt. A bigram method was used to extract a list of 3740 bigrams (pairs) of stopwords. The name of the file: Karakalpak_stopwords_bigram.txt. A set of two-word collocations as stopwords was also extracted. The list has 20745 pairs of stopwords. The name of the file: Karakalpak_stopwords_bigrams_with_collocations.txt.

本数据集包含3组卡拉卡尔帕克语(Karakalpak)停用词列表,所有列表均基于同一语料库,通过三种自动化方法构建得到。 该语料库由23本学校教科书汇编而成,被命名为“卡拉卡尔帕克学校语料库(Karakalpak School Corpus)”。用户可通过语料库包含的所有文件的URL列表重新构建该语料库,该URL列表已作为数据集的组成部分提供,文件名为list_of_urls_for_karakalpak_school_corpus.txt。 方法及停用词列表详情如下: 1. 针对单字停用词,采用语法规则集与TDIDF算法进行自动化采集,共获取4014个停用词,对应文件名为Karakalpak_stopwords_unigrams.txt。 2. 采用二元组方法提取得到3740个二元停用词对,对应文件名为Karakalpak_stopwords_bigram.txt。 3. 同时提取以双词搭配作为停用词的列表,共包含20745个停用词对,对应文件名为Karakalpak_stopwords_bigrams_with_collocations.txt。
创建时间:
2023-03-02
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作