Vector representations of English words and compounds
收藏DataCite Commons2024-05-19 更新2024-07-13 收录
下载链接:
https://fdat.uni-tuebingen.de/records/9jj4v-6dj64
下载链接
链接失效反馈官方服务:
资源简介:
Word representations used in Dima (2019). The vectors were generated from the concatenated encow14ax (https://corporafromtheweb.org/) and English Wikipedia - Müller and Schutze (2015) version, ~9 billion words of text. The corpus was also pre-processed for compounds, i.e. the compounds from the en-comcom dataset were linked with an underscore and treated as a single word - e.g. 'police car' was rewritten to 'police_car'.
Embeddings trained using a minimum word frequency of 100, leading to a vocabulary 424,014 words. The vocabulary words and their frequency in the corpus can be found in the file 'glove_encow14ax_enwiki_9B.400k_min100.vocab'. Word representations with 4 different vector dimensionalities - 50 dimensional, 100 dimensional, 200 dimensional, 300 dimensional.
The embeddings were trained with GloVe, for 15 iterations, using a 10-word symmetric window of text (20 words surrounding a particular word).
MAX_ITER=15
WINDOW_SIZE=10
BINARY=0
NUM_THREADS=8
X_MAX=100
提供机构:
University of Tübingen
创建时间:
2024-02-29



