Source Code Embeddings
收藏NIAID Data Ecosystem2026-03-12 收录
下载链接:
https://zenodo.org/record/2558729
下载链接
链接失效反馈官方服务:
资源简介:
A set of six pretrained fastText models for semantic representations of source code.
Each of the models has been trained on high-quality GitHub repositories where the primary language is one of Java, Python, C++, C#, C, PHP. For collecting training data 13.144 repositories were cloned, 2.402.790.348 lines of code were read out of 944,467,560 files and preprocessed, to finally produce a total of 944.467.560 tokens of clean training data.
For further details refer to the following paper:
Efstathiou, V., Spinellis, D., 2019. "Semantic Source Code Models Using Identifier Embeddings". In 16th International Conference on Mining Software Repositories: Data Showcase Track. MSR'19.
创建时间:
2021-02-02



