five

Word2Vec embeddings for Serbo-Croatian

收藏
DataCite Commons2022-06-01 更新2024-07-13 收录
下载链接:
https://live.european-language-grid.eu/catalogue/ld/17358
下载链接
链接失效反馈
官方服务:
资源简介:
<p>Word2Vec model for Serbo-Croatian was trained on the processed Serbo-Croatian (SH) Wikipedia of May 2020. The processing included parsing the Wikipedia dump (downloaded from https://dumps.wikimedia.org/shwiki/latest/), Part-of-Speech (PoS) tagging, and lemmatization.<br><br>The dump of Wikipedia was parsed into a raw corpus of 637.048 articles using the WikiExtractor library (https://github.com/attardi/wikiextractor).<br><br>First, a FastText model (https://radimrehurek.com/gensim/models/fasttext.html) was trained on the raw tokenized articles to be used for the PoS tagging. More precisely, we trained the Scikit-learn Multilayer Perceptron (MLP) as a PoS tagger on the ParCoTrain corpus (http://redac.univ-tlse2.fr/corpus/parcotrain_en.html). FastText embeddings of tokens from the ParCoTrain corpus were used as features for the MLP, with PoS tags as classes to be predicted. The PoS tagger achieved a 0.9 F1 score.</p><p>After PoS tagging the Wikipedia articles, we lemmatized it using the morpho-syntactic lexicon for Serbian (https://github.com/aleksandra-miletic/serbian-nlp-resources). This included a look-up into the lexicon for a particular word form with the associated PoS tag and resolving it into a lemma. Out of 172 million tokens of all PoS tags, 54.22 % were lemmatized. &nbsp;<br>For the training of Word2Vec, we used the Gensim library (https://radimrehurek.com/gensim/models/word2vec.html). The embedding dimension was set to be 100 and the window size to 5. The trained model we publish contains 1463334 words in the vocabulary.</p>
提供机构:
ELG
创建时间:
2022-06-01
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作