Source Code Embeddings

NIAID Data Ecosystem2026-03-12 收录

下载链接：

https://zenodo.org/record/2558729

下载链接

链接失效反馈

官方服务：

资源简介：

A set of six pretrained fastText models for semantic representations of source code. Each of the models has been trained on high-quality GitHub repositories where the primary language is one of Java, Python, C++, C#, C, PHP. For collecting training data 13.144 repositories were cloned, 2.402.790.348 lines of code were read out of 944,467,560 files and preprocessed, to finally produce a total of 944.467.560 tokens of clean training data. For further details refer to the following paper: Efstathiou, V., Spinellis, D., 2019. "Semantic Source Code Models Using Identifier Embeddings". In 16th International Conference on Mining Software Repositories: Data Showcase Track. MSR'19.

创建时间：

2021-02-02

5,000+

优质数据集

54 个

任务类型

进入经典数据集