LPcode

arXiv2025-09-30 收录

下载链接：

https://github.com/Shinwoo-Park/detecting_llm_paraphrased_code_via_coding_style_features

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集名为LPcode，包含了人类编写的代码与多种大型语言模型（LLM）生成的释义代码的对。该数据集旨在支持两项任务：一是检测代码是否为LLM释义版本，二是识别是哪个LLM对原始代码进行了释义。为确保数据完整性，该数据集经过筛选，移除了除Apache、BSD和MIT许可证之外的代码，并对敏感信息进行了匿名处理。数据集中包含了正样本（释义代码）和负样本（非释义代码），两者比例为1:1。任务的划分为：一是判断LLM生成的代码是否为人类编写代码的释义版本；二是确定是哪个LLM对原始代码进行了释义。

This dataset is named LPcode, which consists of paired samples of human-written code and paraphrased code generated by various large language models (LLMs). This dataset is developed to support two core tasks: first, detecting whether a given code snippet is a paraphrased version generated by an LLM, and second, identifying which specific LLM was used to paraphrase the original human-written code. To ensure data integrity, the dataset has been filtered to retain only code licensed under the Apache, BSD, and MIT open-source licenses, and all sensitive information has been anonymized. The dataset contains positive samples (paraphrased code) and negative samples (non-paraphrased code) at an equal 1:1 ratio. The two defined tasks are as follows: 1) determining whether LLM-generated code is a paraphrased version of human-written code; 2) identifying which LLM was utilized to paraphrase the original code.

5,000+

优质数据集

54 个

任务类型

进入经典数据集