Preprocessed Java Code Corpus

NIAID Data Ecosystem2026-03-11 收录

下载链接：

https://zenodo.org/record/3628521

下载链接

链接失效反馈

官方服务：

资源简介：

A preprocessed code corpus for the Java programming language. The corpus was used for the experiments in the paper Big Code != Big Vocabulary: Open-Vocabulary Models for Source Code. It contains preprocessed-tokenized files for training, validation, testing, and BPE encoding learning. The BPE segmented versions of the above files are also included for three different encoding sizes i,e., 2000, 5000, and 10000 BPE merge operations as well as the learned BPE encodings. Similar versions are also contained for splitting compound identifiers on camelCase and snake_case as in (Allamanis et al., 2015) as well as the corresponding subtoken maps.

创建时间：

2020-01-29

5,000+

优质数据集

54 个

任务类型

进入经典数据集