Data extraction template.

NIAID Data Ecosystem2026-05-01 收录

下载链接：

https://figshare.com/articles/dataset/Data_extraction_template_/25135956

下载链接

链接失效反馈

官方服务：

资源简介：

Code clones, referring to code fragments that are either similar or identical and are copied and pasted within software systems, have negative effects on both software quality and maintenance. The objective of this work is to systematically review and analyze recurrent neural network techniques used to detect code clones to shed light on the current techniques and offer valuable knowledge to the research community. Upon applying the review protocol, we have successfully identified 20 primary studies within this field from a total of 2099 studies. A deep investigation of these studies reveals that nine recurrent neural network techniques have been utilized for code clone detection, with a notable preference for LSTM techniques. These techniques have demonstrated their efficacy in detecting both syntactic and semantic clones, often utilizing abstract syntax trees for source code representation. Moreover, we observed that most studies applied evaluation metrics like F-score, precision, and recall. Additionally, these studies frequently utilized datasets extracted from open-source systems coded in Java and C programming languages. Notably, the Graph-LSTM technique exhibited superior performance. PyTorch and TensorFlow emerged as popular tools for implementing RNN models. To advance code clone detection research, further exploration of techniques like parallel LSTM, sentence-level LSTM, and Tree-Structured GRU is imperative. In addition, more research is needed to investigate the capabilities of the recurrent neural network techniques for identifying semantic clones across different programming languages and binary codes. The development of standardized benchmarks for languages like Python, Scratch, and C#, along with cross-language comparisons, is essential. Therefore, the utilization of recurrent neural network techniques for clone identification is a promising area that demands further research.

代码克隆（code clones）指的是在软件系统内部被复制粘贴的相似或完全相同的代码片段，其对软件质量与软件维护均会产生负面影响。本研究的目标为系统性梳理与分析用于检测代码克隆的循环神经网络（recurrent neural network, RNN）技术，以期阐明当前技术现状，并为相关研究社区提供有价值的学术参考。按照本次综述的研究方案开展检索后，我们从总计2099项相关研究中，成功筛选出该领域的20项核心研究。对这些核心研究的深度调研显示，目前已有9种循环神经网络技术被应用于代码克隆检测，其中长短期记忆网络（LSTM）技术受到了广泛青睐。此类技术已被证实可有效检测语法克隆与语义克隆，且通常会采用抽象语法树（abstract syntax tree, AST）作为源代码的表征形式。此外，我们发现多数研究均采用了F值（F-score）、精确率（precision）与召回率（recall）作为模型评估指标。同时，此类研究常用的数据集多提取自使用Java与C编程语言开发的开源系统。值得注意的是，图长短期记忆网络（Graph-LSTM）技术展现出了更优异的检测性能。PyTorch与TensorFlow已成为实现循环神经网络模型的主流工具。为推动代码克隆检测领域的研究发展，亟需对并行LSTM、句子级LSTM以及树结构门控循环单元（Tree-Structured GRU）等技术展开进一步探索。此外，还需开展更多研究，以探究循环神经网络技术在跨编程语言与二进制代码的语义克隆识别方面的能力。针对Python、Scratch以及C#等编程语言构建标准化基准测试集，并开展跨语言对比研究，同样具有重要意义。因此，将循环神经网络技术应用于克隆识别仍是一个极具发展前景的研究方向，亟需开展更多相关研究。

创建时间：

2024-02-02

5,000+

优质数据集

54 个

任务类型

进入经典数据集