Word2Vec embeddings for Serbo-Croatian

DataCite Commons2022-06-01 更新2024-07-13 收录

下载链接：

https://live.european-language-grid.eu/catalogue/ld/17358

下载链接

链接失效反馈

官方服务：

资源简介：

Word2Vec model for Serbo-Croatian was trained on the processed Serbo-Croatian (SH) Wikipedia of May 2020. The processing included parsing the Wikipedia dump (downloaded from https://dumps.wikimedia.org/shwiki/latest/), Part-of-Speech (PoS) tagging, and lemmatization. The dump of Wikipedia was parsed into a raw corpus of 637.048 articles using the WikiExtractor library (https://github.com/attardi/wikiextractor). First, a FastText model (https://radimrehurek.com/gensim/models/fasttext.html) was trained on the raw tokenized articles to be used for the PoS tagging. More precisely, we trained the Scikit-learn Multilayer Perceptron (MLP) as a PoS tagger on the ParCoTrain corpus (http://redac.univ-tlse2.fr/corpus/parcotrain_en.html). FastText embeddings of tokens from the ParCoTrain corpus were used as features for the MLP, with PoS tags as classes to be predicted. The PoS tagger achieved a 0.9 F1 score.After PoS tagging the Wikipedia articles, we lemmatized it using the morpho-syntactic lexicon for Serbian (https://github.com/aleksandra-miletic/serbian-nlp-resources). This included a look-up into the lexicon for a particular word form with the associated PoS tag and resolving it into a lemma. Out of 172 million tokens of all PoS tags, 54.22 % were lemmatized.   For the training of Word2Vec, we used the Gensim library (https://radimrehurek.com/gensim/models/word2vec.html). The embedding dimension was set to be 100 and the window size to 5. The trained model we publish contains 1463334 words in the vocabulary.

提供机构：

ELG

创建时间：

2022-06-01

5,000+

优质数据集

54 个

任务类型

进入经典数据集