Czech word analogy corpus
收藏arXiv2016-08-02 更新2024-06-21 收录
下载链接:
https://github.com/Svobikl/cz_corpus
下载链接
链接失效反馈官方服务:
资源简介:
本研究介绍了名为‘Czech word analogy corpus’的数据集,由西波西米亚大学应用科学学院创建,旨在探索捷克语词汇的语义和句法特性。该数据集包含22,257个问题,涵盖语义和句法多个类别,如反义词、家庭关系、形容词等级等。数据集的创建过程涉及使用Word2Vec和GloVe算法对捷克语维基百科数据进行预处理和训练。该数据集主要用于评估和改进自然语言处理中词汇嵌入模型的性能,特别是在处理捷克语等形态丰富的语言时的应用。
This study presents a dataset named "Czech Word Analogy Corpus", developed by the Faculty of Applied Sciences, University of West Bohemia, to explore the semantic and syntactic properties of Czech lexical items. This dataset contains 22,257 questions spanning multiple semantic and syntactic categories, including antonyms, familial relationships, adjective gradation, and more. The development of this dataset involved preprocessing Czech Wikipedia data and training using the Word2Vec and GloVe algorithms. This dataset is primarily used to evaluate and enhance the performance of word embedding models in natural language processing (NLP), especially for applications involving morphologically rich languages such as Czech.
提供机构:
西波西米亚大学应用科学学院
创建时间:
2016-08-02



