Enhanced word embeddings using multi-semantic representation through lexical chains
收藏Mendeley Data2024-01-31 更新2024-06-27 收录
下载链接:
http://deepblue.lib.umich.edu/data/concern/data_sets/w9505046h
下载链接
链接失效反馈官方服务:
资源简介:
Title: Enhanced word embeddings using multi-semantic representation through lexical chains document-vectors: The datasets benchmarks (documents) were converted into vectors using the referenced word embeddings models from this work. The proposed synset embeddings are located under synset-models folder Word embeddings used to parse documents -> document-vectors: word2vec (google news), LDA, glove, fastText, USE, ELMo - Details and descriptions are in the original paper linked to this dataset. synset-models: synset corpus trained into a word2vec implementation (300 dimensions, CBOW training model, window size 15, minimum count 10, hierarchical softmax). Parameters not referenced use their default values (https://radimrehurek.com/gensim/models/word2vec.html ) Techniques used: FLLC + MSSA-0R, FLLC + MSSA-1R, FLLC + MSSA-2R FXLC2 + MSSA-0R, FXLC2 + MSSA-1R, FXLC2 + MSSA-2R FXLC4 + MSSA-0R, FXLC4 + MSSA-1R, FXLC4 + MSSA-2R FXLC8 + MSSA-0R, FXLC8 + MSSA-1R, FXLC8 + MSSA-2R The MSSA techniques used are based on the paper of title: Multi-Sense embeddings through a word sense disambiguation process from Ruas, Terry
标题:基于词汇链与文档向量多语义表示的增强词嵌入:本数据集的基准文档均采用本研究引用的词嵌入模型转换为向量形式。
本研究提出的同义集嵌入(synset embeddings)存储于synset-models文件夹下。
用于解析文档以生成文档向量的词嵌入模型包括:word2vec(谷歌新闻语料)、LDA、GloVe、fastText、通用句子编码器(USE)、ELMo。
详细信息与说明请参见本数据集关联的原始论文。
synset-models文件夹内的模型:基于同义集语料库训练得到的word2vec实现(维度为300,采用连续词袋(CBOW)训练模式,窗口大小为15,最小词频阈值为10,使用分层softmax优化)。未指定的参数将采用默认值(参考链接:https://radimrehurek.com/gensim/models/word2vec.html)
本次研究采用的技术组合包括:FLLC + MSSA-0R、FLLC + MSSA-1R、FLLC + MSSA-2R;FXLC2 + MSSA-0R、FXLC2 + MSSA-1R、FXLC2 + MSSA-2R;FXLC4 + MSSA-0R、FXLC4 + MSSA-1R、FXLC4 + MSSA-2R;FXLC8 + MSSA-0R、FXLC8 + MSSA-1R、FXLC8 + MSSA-2R。
本次研究使用的MSSA技术,源自Ruas与Terry发表的题为《基于词义消歧义过程的多语义词嵌入》的学术论文。
创建时间:
2024-01-31



