Preprocessed Java Code Corpus
收藏NIAID Data Ecosystem2026-03-11 收录
下载链接:
https://zenodo.org/record/3628521
下载链接
链接失效反馈官方服务:
资源简介:
A preprocessed code corpus for the Java programming language.
The corpus was used for the experiments in the paper Big Code != Big Vocabulary: Open-Vocabulary Models for Source Code.
It contains preprocessed-tokenized files for training, validation, testing, and BPE encoding learning.
The BPE segmented versions of the above files are also included for three different encoding sizes i,e., 2000, 5000, and 10000 BPE merge operations as well as the learned BPE encodings.
Similar versions are also contained for splitting compound identifiers on camelCase and snake_case as in (Allamanis et al., 2015) as well as the corresponding subtoken maps.
创建时间:
2020-01-29



