five

Dataset and Code for: Code problem similarity detection using code clones and pretrained models

收藏
DataCite Commons2025-06-10 更新2025-04-16 收录
下载链接:
https://researchdata.ntu.edu.sg/citation?persistentId=doi:10.21979/N9/VPCR7H
下载链接
链接失效反馈
官方服务:
资源简介:
This dataset complements the following study: Code problem similarity detection using code clones and pretrained models (SCSE22-0384). This study explores a new approach of detecting similar algorithmic-style code problems from websites such as LeetCode and Codeforces, by comparing the similarity of the solution source codes, an application of type IV code clone detection. It is based on 107,000 submissions in 3 different languages (Python, C++ and Java) from 3,000 problems on Codeforces between 2020 to 2023. Experiments were carried out using 3 different pre-trained models on this dataset (C4-CodeBERT, GraphCodeBERT, UniXcoder). UniXcoder performed the best with an F1 score of 0.905. As such, UniXcoder was used as the backbone of the code problem similarity checker (CPSC) which is used to identify the top similar problems (out of all the problems in the dataset) to an input source code. Based on the tests conducted in this project, his approach achieves state-of-the-art results when it comes to detecting similarity between various code problems. More research can be done, in domains where type IV code clone detection can be useful.
提供机构:
DR-NTU (Data)
创建时间:
2023-05-08
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作