code_x_glue_cc_clone_detection_big_clone_bench
收藏魔搭社区2025-11-27 更新2025-04-26 收录
下载链接:
https://modelscope.cn/datasets/google/code_x_glue_cc_clone_detection_big_clone_bench
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for "code_x_glue_cc_clone_detection_big_clone_bench"
## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits-sample-size)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Homepage:** https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/Clone-detection-BigCloneBench
### Dataset Summary
CodeXGLUE Clone-detection-BigCloneBench dataset, available at https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/Clone-detection-BigCloneBench
Given two codes as the input, the task is to do binary classification (0/1), where 1 stands for semantic equivalence and 0 for others. Models are evaluated by F1 score.
The dataset we use is BigCloneBench and filtered following the paper Detecting Code Clones with Graph Neural Network and Flow-Augmented Abstract Syntax Tree.
### Supported Tasks and Leaderboards
- `semantic-similarity-classification`: The dataset can be used to train a model for classifying if two given java methods are cloens of each other.
### Languages
- Java **programming** language
## Dataset Structure
### Data Instances
An example of 'test' looks as follows.
```
{
"func1": " @Test(expected = GadgetException.class)\n public void malformedGadgetSpecIsCachedAndThrows() throws Exception {\n HttpRequest request = createCacheableRequest();\n expect(pipeline.execute(request)).andReturn(new HttpResponse(\"malformed junk\")).once();\n replay(pipeline);\n try {\n specFactory.getGadgetSpec(createContext(SPEC_URL, false));\n fail(\"No exception thrown on bad parse\");\n } catch (GadgetException e) {\n }\n specFactory.getGadgetSpec(createContext(SPEC_URL, false));\n }\n",
"func2": " public InputStream getInputStream() throws TGBrowserException {\n try {\n if (!this.isFolder()) {\n URL url = new URL(this.url);\n InputStream stream = url.openStream();\n return stream;\n }\n } catch (Throwable throwable) {\n throw new TGBrowserException(throwable);\n }\n return null;\n }\n",
"id": 0,
"id1": 2381663,
"id2": 4458076,
"label": false
}
```
### Data Fields
In the following each data field in go is explained for each config. The data fields are the same among all splits.
#### default
|field name| type | description |
|----------|------|---------------------------------------------------|
|id |int32 | Index of the sample |
|id1 |int32 | The first function id |
|id2 |int32 | The second function id |
|func1 |string| The full text of the first function |
|func2 |string| The full text of the second function |
|label |bool | 1 is the functions are not equivalent, 0 otherwise|
### Data Splits
| name |train |validation| test |
|-------|-----:|---------:|-----:|
|default|901028| 415416|415416|
## Dataset Creation
### Curation Rationale
[More Information Needed]
### Source Data
#### Initial Data Collection and Normalization
Data was mined from the IJaDataset 2.0 dataset.
[More Information Needed]
#### Who are the source language producers?
[More Information Needed]
### Annotations
#### Annotation process
Data was manually labeled by three judges by automatically identifying potential clones using search heuristics.
[More Information Needed]
#### Who are the annotators?
[More Information Needed]
### Personal and Sensitive Information
[More Information Needed]
## Considerations for Using the Data
### Social Impact of Dataset
[More Information Needed]
### Discussion of Biases
Most of the clones are type 1 and 2 with type 3 and especially type 4 being rare.
[More Information Needed]
### Other Known Limitations
[More Information Needed]
## Additional Information
### Dataset Curators
https://github.com/microsoft, https://github.com/madlag
### Licensing Information
Computational Use of Data Agreement (C-UDA) License.
### Citation Information
```
@inproceedings{svajlenko2014towards,
title={Towards a big data curated benchmark of inter-project code clones},
author={Svajlenko, Jeffrey and Islam, Judith F and Keivanloo, Iman and Roy, Chanchal K and Mia, Mohammad Mamun},
booktitle={2014 IEEE International Conference on Software Maintenance and Evolution},
pages={476--480},
year={2014},
organization={IEEE}
}
@inproceedings{wang2020detecting,
title={Detecting Code Clones with Graph Neural Network and Flow-Augmented Abstract Syntax Tree},
author={Wang, Wenhan and Li, Ge and Ma, Bo and Xia, Xin and Jin, Zhi},
booktitle={2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER)},
pages={261--271},
year={2020},
organization={IEEE}
}
```
### Contributions
Thanks to @madlag (and partly also @ncoop57) for adding this dataset.
# 「code_x_glue_cc_clone_detection_big_clone_bench」数据集卡片
## 目录
- [数据集描述](#dataset-description)
- [数据集概述](#dataset-summary)
- [支持任务与评测基准](#supported-tasks)
- [使用语言](#languages)
- [数据集结构](#dataset-structure)
- [数据样例](#data-instances)
- [数据字段](#data-fields)
- [数据划分(样本量)](#data-splits-sample-size)
- [数据集构建](#dataset-creation)
- [构建初衷](#curation-rationale)
- [源数据](#source-data)
- [标注](#annotations)
- [个人与敏感信息](#personal-and-sensitive-information)
- [数据集使用注意事项](#considerations-for-using-the-data)
- [数据集社会影响](#social-impact-of-dataset)
- [偏倚讨论](#discussion-of-biases)
- [其他已知局限性](#other-known-limitations)
- [附加信息](#additional-information)
- [数据集维护者](#dataset-curators)
- [许可信息](#licensing-information)
- [引用信息](#citation-information)
- [贡献致谢](#contributions)
## 数据集描述
- **主页:** https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/Clone-detection-BigCloneBench
### 数据集概述
CodeXGLUE 代码克隆检测-BigCloneBench 数据集,可在 https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/Clone-detection-BigCloneBench 获取。本任务以两段代码作为输入,完成二分类(0/1):1 代表两段代码语义等价,0 代表其他情况。模型性能通过 F1 分数进行评估。本数据集采用 BigCloneBench,并按照《Detecting Code Clones with Graph Neural Network and Flow-Augmented Abstract Syntax Tree》一文的方法进行过滤。
### 支持任务与评测基准
- `语义相似度分类(semantic-similarity-classification)`:该数据集可用于训练模型,以分类两段给定的 Java 方法是否互为代码克隆(code clone)。
### 使用语言
- Java 编程语言
## 数据集结构
### 数据样例
「测试集(test)」的一条样例如下:
{
"func1": " @Test(expected = GadgetException.class)
public void malformedGadgetSpecIsCachedAndThrows() throws Exception {
HttpRequest request = createCacheableRequest();
expect(pipeline.execute(request)).andReturn(new HttpResponse("malformed junk")).once();
replay(pipeline);
try {
specFactory.getGadgetSpec(createContext(SPEC_URL, false));
fail("No exception thrown on bad parse");
} catch (GadgetException e) {
}
specFactory.getGadgetSpec(createContext(SPEC_URL, false));
}
",
"func2": " public InputStream getInputStream() throws TGBrowserException {
try {
if (!this.isFolder()) {
URL url = new URL(this.url);
InputStream stream = url.openStream();
return stream;
}
} catch (Throwable throwable) {
throw new TGBrowserException(throwable);
}
return null;
}
",
"id": 0,
"id1": 2381663,
"id2": 4458076,
"label": false
}
### 数据字段
以下将针对各配置逐一解释数据字段,所有划分下的数据字段均保持一致。
#### 默认配置
| 字段名 | 类型 | 描述 |
|-------|--------|----------------------------------------|
| id | int32 | 样本索引 |
| id1 | int32 | 第一个函数的ID |
| id2 | int32 | 第二个函数的ID |
| func1 | string | 第一个函数的完整文本 |
| func2 | string | 第二个函数的完整文本 |
| label | bool | 1 代表两段函数不等价,0 代表等价 |
### 数据划分
| 划分名称 | 训练集 | 验证集 | 测试集 |
|----------|---------:|---------:|---------:|
| 默认配置 | 901028 | 415416 | 415416 |
## 数据集构建
### 构建初衷
[需要更多信息]
### 源数据
#### 初始数据收集与标准化
数据从 IJaDataset 2.0 数据集挖掘得到。[需要更多信息]
#### 源代码的创作者为谁?
[需要更多信息]
### 标注
#### 标注流程
通过搜索启发式算法自动识别潜在代码克隆后,由三名评审人员手动标注该数据集。[需要更多信息]
#### 标注人员为谁?
[需要更多信息]
### 个人与敏感信息
[需要更多信息]
## 数据集使用注意事项
### 数据集社会影响
[需要更多信息]
### 偏倚讨论
绝大多数克隆为类型1和类型2,类型3尤其是类型4的克隆较为罕见。[需要更多信息]
### 其他已知局限性
[需要更多信息]
## 附加信息
### 数据集维护者
https://github.com/microsoft, https://github.com/madlag
### 许可信息
数据使用协议为 Computational Use of Data Agreement (C-UDA) License。
### 引用信息
@inproceedings{svajlenko2014towards,
title={Towards a big data curated benchmark of inter-project code clones},
author={Svajlenko, Jeffrey and Islam, Judith F and Keivanloo, Iman and Roy, Chanchal K and Mia, Mohammad Mamun},
booktitle={2014 IEEE International Conference on Software Maintenance and Evolution},
pages={476--480},
year={2014},
organization={IEEE}
}
@inproceedings{wang2020detecting,
title={Detecting Code Clones with Graph Neural Network and Flow-Augmented Abstract Syntax Tree},
author={Wang, Wenhan and Li, Ge and Ma, Bo and Xia, Xin and Jin, Zhi},
booktitle={2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER)},
pages={261--271},
year={2020},
organization={IEEE}
}
### 贡献致谢
感谢 @madlag(部分感谢 @ncoop57)添加此数据集。
提供机构:
maas
创建时间:
2025-04-21



