five

Training word vectors on text from The Physics Teacher using Word2Vec

收藏
NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/13907718
下载链接
链接失效反馈
官方服务:
资源简介:
This notebook and dataset allows one to play around with word vectors trained on text from articles in the journal The Physics Teacher published between 1963 (the start of publication) and 2020, around 15000 articles in total. The primary datafile, "TPT_word2vec_words_bigrams_V1.pkl", is a list of cleaned text from these articles. It contains a list, within which each paper is a sub-list. Each sentence in that paper is yet another sub-list which contains the words in that sentence in order. However, in the data cleaning process we have removed “stop words” (like if, and, but, etc.), punctuation, symbols, and numbers, as well as lowercased all words and combined words that frequently go together into one (like “high” and "school” to “high_school”). Here is an example of 3 sentences taken from a random paper:  [['magnet', 'spin',  'tape',  'magnetize',  'strongly',  'time',  'pole',  'approach'],['magnet',  'place',  'center',  'counterweight',  'period',  'magnetize',  'pulse',  'twice',  'long'],['trial', 'tape', 'examine', 'sprinkle_iron', 'filing', 'length'], ... ] In the notebook, we create a set of word vectors from these sentences using the Word2Vec technique, first published by Mikolov et al. (2013): Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space (No. arXiv:1301.3781). arXiv. https://doi.org/10.48550/arXiv.1301.3781 The notebook includes code for both loading in word vectors from a trained model (also included, "TPT_word2vec.model") and creating the same model from the TPT text dataset. Note that word vectors are randomly initialized, so we include a random seed to make this training replicable. Changing the seed will alter some of the results (although the changes seem fairly minor). With a trained model, we demonstrate some applications of word vectors: adding and subtracting meanings (for example "experiment" - "uncertainty" = "demonstration") and visualizing low-dimensional representations of word vectors. In order to install the required packages, you can use the requirements.txt file.  If using pip, run "pip install requirements.txt". Or, if using Anaconda (recommended), you can use "conda install --file requirements.txt". You will also need the software to run jupyter notebooks, which can be installed with Anaconda or pip.
创建时间:
2024-10-09
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作