five

semeru/code-code-CodeCompletion-TokenLevel-Java

收藏
Hugging Face2023-03-24 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/semeru/code-code-CodeCompletion-TokenLevel-Java
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit Programminglanguage: "Java" version: "N/A" Date: "From paper: https://homepages.inf.ed.ac.uk/csutton/publications/msr2013.pdf (2013 - paper release date)" Contaminated: "Very Likely" Size: "Standard Tokenizer (TreeSitter)" --- ### Dataset is imported from CodeXGLUE and pre-processed using their script. # Where to find in Semeru: The dataset can be found at /nfs/semeru/semeru_datasets/code_xglue/code-to-code/CodeCompletion-token/dataset/javaCorpus in Semeru # CodeXGLUE -- Code Completion (token level) **Update 2021.07.30:** We update the code completion dataset with literals normalized to avoid sensitive information. Here is the introduction and pipeline for token level code completion task. ## Task Definition Predict next code token given context of previous tokens. Models are evaluated by token level accuracy. Code completion is a one of the most widely used features in software development through IDEs. An effective code completion tool could improve software developers' productivity. We provide code completion evaluation tasks in two granularities -- token level and line level. Here we introduce token level code completion. Token level task is analogous to language modeling. Models should have be able to predict the next token in arbitary types. ## Dataset The dataset is in java. ### Dependency - javalang == 0.13.0 ### Github Java Corpus We use java corpus dataset mined by Allamanis and Sutton, in their MSR 2013 paper [Mining Source Code Repositories at Massive Scale using Language Modeling](https://homepages.inf.ed.ac.uk/csutton/publications/msr2013.pdf). We follow the same split and preprocessing in Karampatsis's ICSE 2020 paper [Big Code != Big Vocabulary: Open-Vocabulary Models for Source Code](http://homepages.inf.ed.ac.uk/s1467463/documents/icse20-main-1325.pdf). ### Data Format Code corpus are saved in txt format files. one line is a tokenized code snippets: ``` <s> from __future__ import unicode_literals <EOL> from django . db import models , migrations <EOL> class Migration ( migrations . Migration ) : <EOL> dependencies = [ <EOL> ] <EOL> operations = [ <EOL> migrations . CreateModel ( <EOL> name = '<STR_LIT>' , <EOL> fields = [ <EOL> ( '<STR_LIT:id>' , models . AutoField ( verbose_name = '<STR_LIT>' , serialize = False , auto_created = True , primary_key = True ) ) , <EOL> ( '<STR_LIT:name>' , models . CharField ( help_text = b'<STR_LIT>' , max_length = <NUM_LIT> ) ) , <EOL> ( '<STR_LIT:image>' , models . ImageField ( help_text = b'<STR_LIT>' , null = True , upload_to = b'<STR_LIT>' , blank = True ) ) , <EOL> ] , <EOL> options = { <EOL> '<STR_LIT>' : ( '<STR_LIT:name>' , ) , <EOL> '<STR_LIT>' : '<STR_LIT>' , <EOL> } , <EOL> bases = ( models . Model , ) , <EOL> ) , <EOL> ] </s> ``` ### Data Statistics Data statistics of Github Java Corpus dataset are shown in the below table: | Data Split | #Files | #Tokens | | ----------- | :--------: | :---------: | | Train | 12,934 | 15.7M | | Dev | 7,176 | 3.8M | | Test | 8,268 | 5.3M |

license: MIT 许可证 Programming language: "Java" version: "N/A" Date: "数据来源:源自论文《https://homepages.inf.ed.ac.uk/csutton/publications/msr2013.pdf》(论文发布于2013年)" Contaminated: "数据污染情况:极有可能存在污染" Size: "分词工具:标准分词器(TreeSitter)" --- ### 本数据集从 CodeXGLUE 导入,并使用其官方脚本完成预处理。 # 在 Semeru 集群中的存储路径: 该数据集可于 Semeru 集群的 `/nfs/semeru/semeru_datasets/code_xglue/code-to-code/CodeCompletion-token/dataset/javaCorpus` 路径下获取。 # CodeXGLUE —— 代码补全(分词级) **2021.07.30 更新**:本次更新对代码补全数据集中的字面量进行了规范化处理,以规避敏感信息泄露风险。 以下为分词级代码补全任务的介绍与处理流程。 ## 任务定义 根据前文代码 Token 的上下文预测下一个代码 Token,模型性能以分词级准确率进行评估。 代码补全是集成开发环境(IDE)中软件开发场景下应用最广泛的功能之一,高效的代码补全工具可有效提升软件开发人员的工作效率。本数据集提供两种粒度的代码补全评估任务:分词级与行级,本文将介绍分词级代码补全任务。该任务与语言建模任务类似,模型需具备对任意类型的下一个 Token 进行预测的能力。 ## 数据集 本数据集基于 Java 语言构建。 ### 依赖项 - javalang == 0.13.0 ### GitHub Java 语料库 本数据集采用 Allamanis 与 Sutton 在其 2013 年 MSR 论文《大规模语言建模式源代码仓库挖掘》(https://homepages.inf.ed.ac.uk/csutton/publications/msr2013.pdf)中构建的 Java 语料库,并严格遵循 Karampatsis 在 2020 年 ICSE 论文《大代码≠大词汇:面向源代码的开放词汇模型》(http://homepages.inf.ed.ac.uk/s1467463/documents/icse20-main-1325.pdf)中提出的数据集划分与预处理方案。 ### 数据格式 代码语料以纯文本格式存储,每行对应一段经过分词的代码片段: <s> from __future__ import unicode_literals <EOL> from django . db import models , migrations <EOL> class Migration ( migrations . Migration ) : <EOL> dependencies = [ <EOL> ] <EOL> operations = [ <EOL> migrations . CreateModel ( <EOL> name = '<STR_LIT>' , <EOL> fields = [ <EOL> ( '<STR_LIT:id>' , models . AutoField ( verbose_name = '<STR_LIT>' , serialize = False , auto_created = True , primary_key = True ) ) , <EOL> ( '<STR_LIT:name>' , models . CharField ( help_text = b'<STR_LIT>' , max_length = <NUM_LIT> ) ) , <EOL> ( '<STR_LIT:image>' , models . ImageField ( help_text = b'<STR_LIT>' , null = True , upload_to = b'<STR_LIT>' , blank = True ) ) , <EOL> ] , <EOL> options = { <EOL> '<STR_LIT>' : ( '<STR_LIT:name>' , ) , <EOL> '<STR_LIT>' : '<STR_LIT>' , <EOL> } , <EOL> bases = ( models . Model , ) , <EOL> ) , <EOL> ] </s> ### 数据统计 GitHub Java 语料库的数据集统计信息如下表所示: | 数据划分 | 文件数量 | Token 总数 | | :--------: | :--------: | :---------: | | 训练集 | 12,934 | 15.7M | | 验证集 | 7,176 | 3.8M | | 测试集 | 8,268 | 5.3M |
提供机构:
semeru
原始信息汇总

数据集概述

基本信息

  • 许可证: MIT
  • 编程语言: Java
  • 数据集来源: 从CodeXGLUE导入并使用其脚本进行预处理
  • 数据集位置: /nfs/semeru/semeru_datasets/code_xglue/code-to-code/CodeCompletion-token/dataset/javaCorpus

任务定义

  • 任务: 代码完成(token级别)
  • 目标: 预测给定上下文后的下一个代码token
  • 评估指标: 基于token级别的准确性

数据集详情

  • 数据集语言: Java
  • 依赖: javalang == 0.13.0
  • 数据来源: 由Allamanis和Sutton在MSR 2013论文中挖掘的Java语料库
  • 数据预处理: 遵循Karampatsis在ICSE 2020论文中的分割和预处理方法

数据格式

  • 存储格式: 文本文件
  • 数据结构: 每行包含一个tokenized的代码片段

数据统计

数据分割 #文件 #Tokens
训练 12,934 15.7M
开发 7,176 3.8M
测试 8,268 5.3M

更新信息

  • 最新更新: 2021.07.30,更新了代码完成数据集,对字面量进行了规范化处理,以避免敏感信息。
搜集汇总
数据集介绍
main_image_url
构建方式
在软件工程领域,代码自动补全技术对于提升开发效率具有关键意义。该数据集源自CodeXGLUE项目,其构建过程基于Allamanis与Sutton在2013年MSR会议上发表的论文所提出的Java语料库。原始数据通过大规模挖掘GitHub开源仓库获得,并遵循Karampatsis等人于ICSE 2020论文中的预处理方法进行标准化分割。具体而言,代码片段被转换为标记序列,并采用树状解析器进行分词处理,同时将字面量归一化以消除敏感信息,最终形成适用于令牌级代码补全任务的训练、验证与测试集。
特点
该数据集专注于Java编程语言的令牌级代码补全任务,其核心特点在于高度结构化的标记序列表示。数据经过精心处理,每个代码片段均被转换为以特殊符号分隔的令牌流,例如起始符<s>与结束符</s>,并明确标注行尾<EOL>及字面量类型如<STR_LIT>和<NUM_LIT>。数据规模较为均衡,训练集包含约1570万令牌,验证集与测试集分别涵盖380万与530万令牌,为模型训练提供了充足的样本支持。这种设计使得数据集能够有效模拟IDE环境中的实时补全场景,支持模型基于上下文预测下一个代码令牌。
使用方法
该数据集主要用于训练与评估令牌级代码补全模型,其使用流程清晰明确。研究人员可直接加载预处理后的文本文件,其中每行代表一个已标记化的代码序列。模型接收历史令牌序列作为输入,并预测下一个可能出现的代码令牌。评估标准采用令牌级准确率,以衡量模型在真实编程语境下的补全能力。数据集已按标准划分为训练、验证与测试子集,便于进行模型训练、超参数调优与性能测试。此外,数据格式与CodeXGLUE框架兼容,支持无缝集成至现有机器学习管道中,加速代码智能领域的实验迭代。
背景与挑战
背景概述
在软件工程与人工智能交叉领域,代码自动补全技术作为提升开发效率的关键工具,一直备受关注。semeru/code-code-CodeCompletion-TokenLevel-Java数据集源于2013年Allamanis与Sutton在MSR会议上发表的论文《Mining Source Code Repositories at Massive Scale using Language Modeling》,该研究开创性地将语言建模应用于大规模源代码仓库挖掘。数据集基于GitHub上的Java代码库构建,旨在解决代码补全任务中的细粒度预测问题,即根据上下文预测下一个代码标记。该数据集通过CodeXGLUE平台标准化处理,为后续研究如Karampatsis在ICSE 2020提出的开放词汇模型提供了基准,推动了代码智能领域从传统方法向深度学习范式的演进。
当前挑战
该数据集面临的挑战主要体现在两方面:在领域问题层面,代码补全任务需应对编程语言的复杂结构,如语法规则、变量作用域及API调用模式,模型需在高度动态的上下文中准确预测标记,同时避免泄露敏感信息(如原始字面值)。构建过程中,数据预处理依赖特定工具如javalang 0.13.0进行标记化,而大规模代码仓库的异构性(如注释风格、代码质量差异)增加了数据清洗与标准化的难度;此外,数据分割遵循早期研究设定,可能无法完全反映现代代码库的演化特征,对模型泛化能力构成潜在限制。
常用场景
经典使用场景
在软件工程与编程语言处理领域,代码自动补全技术是提升开发效率的关键工具。semeru/code-code-CodeCompletion-TokenLevel-Java数据集作为CodeXGLUE基准的一部分,专门用于训练和评估基于上下文的代码令牌预测模型。该数据集通过提供大量Java代码片段,支持模型学习编程语言的语法结构和语义模式,从而实现精准的下一令牌预测。其经典应用场景包括集成开发环境中的智能代码提示、代码生成系统的核心模块训练,以及编程语言模型的微调与验证。
衍生相关工作
围绕该数据集,学术界衍生了一系列经典研究。例如,Karampatsis等人的ICSE 2020论文《Big Code != Big Vocabulary》利用该数据探索了开放词汇模型在源代码上的有效性。同时,CodeXGLUE基准将其整合为代码补全任务的核心评估集,推动了如CodeBERT、GPT-Code等预训练模型在编程语言领域的适配与优化。这些工作共同深化了代码表示学习与自动生成的理论与实践。
数据集最近研究
最新研究方向
在软件工程与人工智能交叉领域,代码自动补全作为提升开发效率的核心技术,正受到广泛关注。基于semeru/code-code-CodeCompletion-TokenLevel-Java数据集的研究,前沿方向聚焦于开放词汇建模与上下文感知的代码生成。学者们借鉴Karampatsis在ICSE 2020提出的方法,探索如何利用大规模代码语料库训练模型,以应对编程语言中动态变化的词汇表,避免传统封闭词汇模型的局限。热点事件包括CodeXGLUE基准的更新,其中通过规范化字面量来保护敏感信息,体现了数据隐私与模型泛化能力的平衡。这些进展不仅推动了智能开发工具的实用化,还为代码理解、缺陷检测等衍生任务提供了理论基础,具有显著的工程与学术价值。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作