LPcode
收藏arXiv2025-09-30 收录
下载链接:
https://github.com/Shinwoo-Park/detecting_llm_paraphrased_code_via_coding_style_features
下载链接
链接失效反馈官方服务:
资源简介:
该数据集名为LPcode,包含了人类编写的代码与多种大型语言模型(LLM)生成的释义代码的对。该数据集旨在支持两项任务:一是检测代码是否为LLM释义版本,二是识别是哪个LLM对原始代码进行了释义。为确保数据完整性,该数据集经过筛选,移除了除Apache、BSD和MIT许可证之外的代码,并对敏感信息进行了匿名处理。数据集中包含了正样本(释义代码)和负样本(非释义代码),两者比例为1:1。任务的划分为:一是判断LLM生成的代码是否为人类编写代码的释义版本;二是确定是哪个LLM对原始代码进行了释义。
This dataset is named LPcode, which consists of paired samples of human-written code and paraphrased code generated by various large language models (LLMs). This dataset is developed to support two core tasks: first, detecting whether a given code snippet is a paraphrased version generated by an LLM, and second, identifying which specific LLM was used to paraphrase the original human-written code. To ensure data integrity, the dataset has been filtered to retain only code licensed under the Apache, BSD, and MIT open-source licenses, and all sensitive information has been anonymized. The dataset contains positive samples (paraphrased code) and negative samples (non-paraphrased code) at an equal 1:1 ratio. The two defined tasks are as follows: 1) determining whether LLM-generated code is a paraphrased version of human-written code; 2) identifying which LLM was utilized to paraphrase the original code.



