code_x_glue_cc_clone_detection_big_clone_bench

Name: code_x_glue_cc_clone_detection_big_clone_bench
Creator: maas
Published: 2025-11-27 16:30:41
License: 暂无描述

魔搭社区2025-11-27 更新2025-04-26 收录

下载链接：

https://modelscope.cn/datasets/google/code_x_glue_cc_clone_detection_big_clone_bench

下载链接

链接失效反馈

官方服务：

资源简介：

# Dataset Card for "code_x_glue_cc_clone_detection_big_clone_bench" ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits-sample-size) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/Clone-detection-BigCloneBench ### Dataset Summary CodeXGLUE Clone-detection-BigCloneBench dataset, available at https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/Clone-detection-BigCloneBench Given two codes as the input, the task is to do binary classification (0/1), where 1 stands for semantic equivalence and 0 for others. Models are evaluated by F1 score. The dataset we use is BigCloneBench and filtered following the paper Detecting Code Clones with Graph Neural Network and Flow-Augmented Abstract Syntax Tree. ### Supported Tasks and Leaderboards - `semantic-similarity-classification`: The dataset can be used to train a model for classifying if two given java methods are cloens of each other. ### Languages - Java **programming** language ## Dataset Structure ### Data Instances An example of 'test' looks as follows. ``` { "func1": " @Test(expected = GadgetException.class)\n public void malformedGadgetSpecIsCachedAndThrows() throws Exception {\n HttpRequest request = createCacheableRequest();\n expect(pipeline.execute(request)).andReturn(new HttpResponse(\"malformed junk\")).once();\n replay(pipeline);\n try {\n specFactory.getGadgetSpec(createContext(SPEC_URL, false));\n fail(\"No exception thrown on bad parse\");\n } catch (GadgetException e) {\n }\n specFactory.getGadgetSpec(createContext(SPEC_URL, false));\n }\n", "func2": " public InputStream getInputStream() throws TGBrowserException {\n try {\n if (!this.isFolder()) {\n URL url = new URL(this.url);\n InputStream stream = url.openStream();\n return stream;\n }\n } catch (Throwable throwable) {\n throw new TGBrowserException(throwable);\n }\n return null;\n }\n", "id": 0, "id1": 2381663, "id2": 4458076, "label": false } ``` ### Data Fields In the following each data field in go is explained for each config. The data fields are the same among all splits. #### default |field name| type | description | |----------|------|---------------------------------------------------| |id |int32 | Index of the sample | |id1 |int32 | The first function id | |id2 |int32 | The second function id | |func1 |string| The full text of the first function | |func2 |string| The full text of the second function | |label |bool | 1 is the functions are not equivalent, 0 otherwise| ### Data Splits | name |train |validation| test | |-------|-----:|---------:|-----:| |default|901028| 415416|415416| ## Dataset Creation ### Curation Rationale [More Information Needed] ### Source Data #### Initial Data Collection and Normalization Data was mined from the IJaDataset 2.0 dataset. [More Information Needed] #### Who are the source language producers? [More Information Needed] ### Annotations #### Annotation process Data was manually labeled by three judges by automatically identifying potential clones using search heuristics. [More Information Needed] #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases Most of the clones are type 1 and 2 with type 3 and especially type 4 being rare. [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators https://github.com/microsoft, https://github.com/madlag ### Licensing Information Computational Use of Data Agreement (C-UDA) License. ### Citation Information ``` @inproceedings{svajlenko2014towards, title={Towards a big data curated benchmark of inter-project code clones}, author={Svajlenko, Jeffrey and Islam, Judith F and Keivanloo, Iman and Roy, Chanchal K and Mia, Mohammad Mamun}, booktitle={2014 IEEE International Conference on Software Maintenance and Evolution}, pages={476--480}, year={2014}, organization={IEEE} } @inproceedings{wang2020detecting, title={Detecting Code Clones with Graph Neural Network and Flow-Augmented Abstract Syntax Tree}, author={Wang, Wenhan and Li, Ge and Ma, Bo and Xia, Xin and Jin, Zhi}, booktitle={2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER)}, pages={261--271}, year={2020}, organization={IEEE} } ``` ### Contributions Thanks to @madlag (and partly also @ncoop57) for adding this dataset.

# 「code_x_glue_cc_clone_detection_big_clone_bench」数据集卡片 ## 目录 - [数据集描述](#dataset-description) - [数据集概述](#dataset-summary) - [支持任务与评测基准](#supported-tasks) - [使用语言](#languages) - [数据集结构](#dataset-structure) - [数据样例](#data-instances) - [数据字段](#data-fields) - [数据划分（样本量）](#data-splits-sample-size) - [数据集构建](#dataset-creation) - [构建初衷](#curation-rationale) - [源数据](#source-data) - [标注](#annotations) - [个人与敏感信息](#personal-and-sensitive-information) - [数据集使用注意事项](#considerations-for-using-the-data) - [数据集社会影响](#social-impact-of-dataset) - [偏倚讨论](#discussion-of-biases) - [其他已知局限性](#other-known-limitations) - [附加信息](#additional-information) - [数据集维护者](#dataset-curators) - [许可信息](#licensing-information) - [引用信息](#citation-information) - [贡献致谢](#contributions) ## 数据集描述 - **主页：** https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/Clone-detection-BigCloneBench ### 数据集概述 CodeXGLUE 代码克隆检测-BigCloneBench 数据集，可在 https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/Clone-detection-BigCloneBench 获取。本任务以两段代码作为输入，完成二分类（0/1）：1 代表两段代码语义等价，0 代表其他情况。模型性能通过 F1 分数进行评估。本数据集采用 BigCloneBench，并按照《Detecting Code Clones with Graph Neural Network and Flow-Augmented Abstract Syntax Tree》一文的方法进行过滤。 ### 支持任务与评测基准 - `语义相似度分类（semantic-similarity-classification）`：该数据集可用于训练模型，以分类两段给定的 Java 方法是否互为代码克隆（code clone）。 ### 使用语言 - Java 编程语言 ## 数据集结构 ### 数据样例「测试集（test）」的一条样例如下： { "func1": " @Test(expected = GadgetException.class) public void malformedGadgetSpecIsCachedAndThrows() throws Exception { HttpRequest request = createCacheableRequest(); expect(pipeline.execute(request)).andReturn(new HttpResponse("malformed junk")).once(); replay(pipeline); try { specFactory.getGadgetSpec(createContext(SPEC_URL, false)); fail("No exception thrown on bad parse"); } catch (GadgetException e) { } specFactory.getGadgetSpec(createContext(SPEC_URL, false)); } ", "func2": " public InputStream getInputStream() throws TGBrowserException { try { if (!this.isFolder()) { URL url = new URL(this.url); InputStream stream = url.openStream(); return stream; } } catch (Throwable throwable) { throw new TGBrowserException(throwable); } return null; } ", "id": 0, "id1": 2381663, "id2": 4458076, "label": false } ### 数据字段以下将针对各配置逐一解释数据字段，所有划分下的数据字段均保持一致。 #### 默认配置 | 字段名 | 类型 | 描述 | |-------|--------|----------------------------------------| | id | int32 | 样本索引 | | id1 | int32 | 第一个函数的ID | | id2 | int32 | 第二个函数的ID | | func1 | string | 第一个函数的完整文本 | | func2 | string | 第二个函数的完整文本 | | label | bool | 1 代表两段函数不等价，0 代表等价 | ### 数据划分 | 划分名称 | 训练集 | 验证集 | 测试集 | |----------|---------:|---------:|---------:| | 默认配置 | 901028 | 415416 | 415416 | ## 数据集构建 ### 构建初衷 [需要更多信息] ### 源数据 #### 初始数据收集与标准化数据从 IJaDataset 2.0 数据集挖掘得到。[需要更多信息] #### 源代码的创作者为谁？ [需要更多信息] ### 标注 #### 标注流程通过搜索启发式算法自动识别潜在代码克隆后，由三名评审人员手动标注该数据集。[需要更多信息] #### 标注人员为谁？ [需要更多信息] ### 个人与敏感信息 [需要更多信息] ## 数据集使用注意事项 ### 数据集社会影响 [需要更多信息] ### 偏倚讨论绝大多数克隆为类型1和类型2，类型3尤其是类型4的克隆较为罕见。[需要更多信息] ### 其他已知局限性 [需要更多信息] ## 附加信息 ### 数据集维护者 https://github.com/microsoft, https://github.com/madlag ### 许可信息数据使用协议为 Computational Use of Data Agreement (C-UDA) License。 ### 引用信息 @inproceedings{svajlenko2014towards, title={Towards a big data curated benchmark of inter-project code clones}, author={Svajlenko, Jeffrey and Islam, Judith F and Keivanloo, Iman and Roy, Chanchal K and Mia, Mohammad Mamun}, booktitle={2014 IEEE International Conference on Software Maintenance and Evolution}, pages={476--480}, year={2014}, organization={IEEE} } @inproceedings{wang2020detecting, title={Detecting Code Clones with Graph Neural Network and Flow-Augmented Abstract Syntax Tree}, author={Wang, Wenhan and Li, Ge and Ma, Bo and Xia, Xin and Jin, Zhi}, booktitle={2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER)}, pages={261--271}, year={2020}, organization={IEEE} } ### 贡献致谢感谢 @madlag（部分感谢 @ncoop57）添加此数据集。

提供机构：

maas

创建时间：

2025-04-21

5,000+

优质数据集

54 个

任务类型

进入经典数据集