Details of dataset information.

Figshare2024-05-10 更新2026-04-28 收录

下载链接：

https://figshare.com/articles/dataset/Details_of_dataset_information_/25797326

下载链接

链接失效反馈

官方服务：

资源简介：

In software development, it’s common to reuse existing source code by copying and pasting, resulting in the proliferation of numerous code clones—similar or identical code fragments—that detrimentally affect software quality and maintainability. Although several techniques for code clone detection exist, many encounter challenges in effectively identifying semantic clones due to their inability to extract syntax and semantics information. Fewer techniques leverage low-level source code representations like bytecode or assembly for clone detection. This work introduces a novel code representation for identifying syntactic and semantic clones in Java source code. It integrates high-level features extracted from the Abstract Syntax Tree with low-level features derived from intermediate representations generated by static analysis tools, like the Soot framework. Leveraging this combined representation, fifteen machine-learning models are trained to effectively detect code clones. Evaluation on a large dataset demonstrates the models’ efficacy in accurately identifying semantic clones. Among these classifiers, ensemble classifiers, such as the LightGBM classifier, exhibit exceptional accuracy. Linearly combining features enhances the effectiveness of the models compared to multiplication and distance combination techniques. The experimental findings indicate that the proposed method can outperform the current clone detection techniques in detecting semantic clones.

在软件开发领域，通过复制粘贴复用已有源代码是常见操作，这导致了大量代码克隆（code clones）——即相似或完全相同的代码片段——的泛滥，进而对软件质量与可维护性造成负面影响。尽管现有多种代码克隆检测技术，但多数方法因无法提取语法与语义信息，难以有效识别语义克隆（semantic clones）。而采用字节码（bytecode）或汇编（assembly）这类底层源代码表示形式进行克隆检测的技术则更为稀少。本研究提出一种全新的代码表示方法，用于识别Java源代码中的语法克隆与语义克隆。该方法将从抽象语法树（Abstract Syntax Tree, AST）中提取的高层特征，与由静态分析工具（如Soot框架）生成的中间表示形式所导出的低层特征进行融合。依托这种融合后的表示形式，本研究训练了15种机器学习模型以实现高效的代码克隆检测。在大型数据集上开展的评估实验表明，这些模型能够准确识别语义克隆，验证了其有效性。在这些分类器中，集成分类器（ensemble classifiers）——例如LightGBM分类器——展现出了优异的分类精度。与特征相乘、距离组合等方法相比，采用线性组合特征的方式能够进一步提升模型的检测效果。实验结果表明，所提方法在语义克隆检测任务上能够优于现有克隆检测技术。

创建时间：

2024-05-10

5,000+

优质数据集

54 个

任务类型

进入经典数据集