LScDC (Leicester Scientific Dictionary-Core)
收藏figshare.le.ac.uk2020-04-15 更新2025-01-21 收录
下载链接:
https://figshare.le.ac.uk/articles/dataset/LScDC_Leicester_Scientific_Dictionary-Core_/9896579/3
下载链接
链接失效反馈官方服务:
资源简介:
The LScDC (Leicester Scientific Dictionary-Core Dictionary)April 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk/suzenneslihan@hotmail.com)Supervised by Prof Alexander Gorban and Dr Evgeny Mirkes[Version 3] The third version of LScDC (Leicester Scientific Dictionary-Core) is formed using the updated LScD (Leicester Scientific Dictionary) - Version 3*. All steps applied to build the new version of core dictionary are the same as in Version 2** and can be found in description of Version 2 below. We did not repeat the explanation. The files provided with this description are also same as described as for LScDC Version 2. The numbers of words in the 3rd versions of LScD and LScDC are summarized below. # of wordsLScD (v3) 972,060LScDC (v3) 103,998 * Suzen, Neslihan (2019): LScD (Leicester Scientific Dictionary). figshare. Dataset. https://doi.org/10.25392/leicester.data.9746900.v3 ** Suzen, Neslihan (2019): LScDC (Leicester Scientific Dictionary-Core). figshare. Dataset. https://doi.org/10.25392/leicester.data.9896579.v2[Version 2] Getting StartedThis file describes a sorted and cleaned list of words from LScD (Leicester Scientific Dictionary), explains steps for sub-setting the LScD and basic statistics of words in the LSC (Leicester Scientific Corpus), to be found in [1, 2]. The LScDC (Leicester Scientific Dictionary-Core) is a list of words ordered by the number of documents containing the words, and is available in the CSV file published. There are 104,223 unique words (lemmas) in the LScDC. This dictionary is created to be used in future work on the quantification of the sense of research texts. The objective of sub-setting the LScD is to discard words which appear too rarely in the corpus. In text mining algorithms, usage of enormous number of text data brings the challenge to the performance and the accuracy of data mining applications. The performance and the accuracy of models are heavily depend on the type of words (such as stop words and content words) and the number of words in the corpus. Rare occurrence of words in a collection is not useful in discriminating texts in large corpora as rare words are likely to be non-informative signals (or noise) and redundant in the collection of texts. The selection of relevant words also holds out the possibility of more effective and faster operation of text mining algorithms.To build the LScDC, we decided the following process on LScD: removing words that appear in no more than 10 documents (
《莱斯特科学词典核心版》(LScDC)发布于2020年4月,由莱斯特大学博士研究生Neslihan Suzen编撰[ns433@leicester.ac.uk/suzenneslihan@hotmail.com]。本版词典由Alexander Gorban教授和Evgeny Mirkes博士指导[版本3]。LScDC(莱斯特科学词典核心版)的第三版是基于更新的LScD(莱斯特科学词典)- 版本3*构建而成。构建新版核心词典的步骤与版本2**相同,详情可参考版本2的描述。本描述中提供的文件与LScDC版本2的描述相同。以下是LScD和LScDC的第三版中词汇数量的汇总:
LScD(v3)词汇数量:972,060
LScDC(v3)词汇数量:103,998
Suzen, Neslihan (2019): LScD(莱斯特科学词典). figshare. 数据集. https://doi.org/10.25392/leicester.data.9746900.v3
** Suzen, Neslihan (2019): LScDC(莱斯特科学词典核心版). figshare. 数据集. https://doi.org/10.25392/leicester.data.9896579.v2
[版本2] 入门指南
本文件描述了从LScD(莱斯特科学词典)中提取的排序和清洗后的词汇列表,并解释了如何对LScD进行子集划分以及LSC(莱斯特科学语料库)中词汇的基本统计信息,详情可参考[1, 2]。LScDC(莱斯特科学词典核心版)是一个按包含词汇的文档数量排序的词汇列表,并以CSV文件的形式发布。LScDC中包含104,223个独特的词汇(词元)。本词典旨在用于未来对研究文本意义的量化工作中。对LScD进行子集划分的目的是去除在语料库中出现频率过低的词汇。在文本挖掘算法中,大量文本数据的使用带来了性能和挖掘应用准确性的挑战。模型的性能和准确性在很大程度上取决于词汇的类型(如停用词和实义词)以及语料库中词汇的数量。在集合中词汇的罕见出现对于在大规模语料库中区分文本没有帮助,因为罕见词汇很可能是非信息性信号(或噪声),并在文本集合中冗余。相关词汇的选择也提供了文本挖掘算法更加高效和快速运行的可能性。为了构建LScDC,我们对LScD采取了以下过程:移除在不超过10个文档中出现的词汇(
提供机构:
University of Leicester



