code_x_glue_cc_clone_detection_poj104
收藏魔搭社区2025-10-09 更新2025-04-26 收录
下载链接:
https://modelscope.cn/datasets/google/code_x_glue_cc_clone_detection_poj104
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for "code_x_glue_cc_clone_detection_poj_104"
## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits-sample-size)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Homepage:** https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/Clone-detection-POJ-104
### Dataset Summary
CodeXGLUE Clone-detection-POJ-104 dataset, available at https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/Clone-detection-POJ-104
Given a code and a collection of candidates as the input, the task is to return Top K codes with the same semantic. Models are evaluated by MAP score.
We use POJ-104 dataset on this task.
### Supported Tasks and Leaderboards
- `document-retrieval`: The dataset can be used to train a model for retrieving top-k codes with the same semantics.
### Languages
- C++ **programming** language
## Dataset Structure
### Data Instances
An example of 'train' looks as follows.
```
{
"code": "\nint f(int shu,int min)\n{ \n int k=1;\n if(shu < min)\n { \n k= 0; \n return k;\n } \n else\n {\n for(int i = min;i<shu;i++)\n { \n if(shu%i == 0)\n { \n k=k+ f(shu/i,i); \n } \n \n \n } \n return k; \n}\n} \n\nmain()\n{\n int n,i,a;\n scanf(\"%d\",&n);\n \n for(i=0;i<n;i++)\n {\n scanf(\"%d\",&a);\n \n if(i!=n-1) \n printf(\"%d\\n\",f(a,2));\n else\n printf(\"%d\",f(a,2)); \n \n \n \n } \n \n \n }",
"id": 0,
"label": "home"
}
```
### Data Fields
In the following each data field in go is explained for each config. The data fields are the same among all splits.
#### default
|field name| type | description |
|----------|------|----------------------------------------------|
|id |int32 | Index of the sample |
|code |string| The full text of the function |
|label |string| The id of problem that the source code solves|
### Data Splits
| name |train|validation|test |
|-------|----:|---------:|----:|
|default|32000| 8000|12000|
## Dataset Creation
### Curation Rationale
[More Information Needed]
### Source Data
#### Initial Data Collection and Normalization
[More Information Needed]
#### Who are the source language producers?
[More Information Needed]
### Annotations
#### Annotation process
[More Information Needed]
#### Who are the annotators?
[More Information Needed]
### Personal and Sensitive Information
[More Information Needed]
## Considerations for Using the Data
### Social Impact of Dataset
[More Information Needed]
### Discussion of Biases
[More Information Needed]
### Other Known Limitations
[More Information Needed]
## Additional Information
### Dataset Curators
https://github.com/microsoft, https://github.com/madlag
### Licensing Information
Computational Use of Data Agreement (C-UDA) License.
### Citation Information
```
@inproceedings{mou2016convolutional,
title={Convolutional neural networks over tree structures for programming language processing},
author={Mou, Lili and Li, Ge and Zhang, Lu and Wang, Tao and Jin, Zhi},
booktitle={Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence},
pages={1287--1293},
year={2016}
}
```
### Contributions
Thanks to @madlag (and partly also @ncoop57) for adding this dataset.
# 「code_x_glue_cc_clone_detection_poj_104」数据集卡片
## 目录
- [数据集描述](#数据集描述)
- [数据集摘要](#数据集摘要)
- [支持的任务与排行榜](#支持的任务与排行榜)
- [使用语言](#使用语言)
- [数据集结构](#数据集结构)
- [数据实例](#数据实例)
- [数据字段](#数据字段)
- [数据划分(样本数量)](#数据划分样本数量)
- [数据集构建](#数据集构建)
- [构建初衷](#构建初衷)
- [源数据](#源数据)
- [标注信息](#标注信息)
- [个人与敏感信息](#个人与敏感信息)
- [数据集使用注意事项](#数据集使用注意事项)
- [数据集的社会影响](#数据集的社会影响)
- [偏差讨论](#偏差讨论)
- [其他已知局限性](#其他已知局限性)
- [附加信息](#附加信息)
- [数据集维护者](#数据集维护者)
- [许可协议](#许可协议)
- [引用信息](#引用信息)
- [贡献说明](#贡献说明)
## 数据集描述
- **主页**:https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/Clone-detection-POJ-104
### 数据集摘要
代码XGLUE(CodeXGLUE)克隆检测-POJ-104数据集,可从以下地址获取:https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/Clone-detection-POJ-104
该任务以一段代码与一组候选代码作为输入,目标是返回语义相同的Top K段代码。模型采用平均精度均值(Mean Average Precision, MAP)指标进行评估。本任务使用POJ-104数据集。
### 支持的任务与排行榜
- `文档检索`:该数据集可用于训练模型以检索语义相同的Top-K段代码。
### 使用语言
- C++ 编程语言
## 数据集结构
### 数据实例
训练集的一个数据示例如下:
{
"code": "
int f(int shu,int min)
{
int k=1;
if(shu < min)
{
k= 0;
return k;
}
else
{
for(int i = min;i<shu;i++)
{
if(shu%i == 0)
{
k=k+ f(shu/i,i);
}
}
return k;
}
}
main()
{
int n,i,a;
scanf("%d",&n);
for(i=0;i<n;i++)
{
scanf("%d",&a);
if(i!=n-1)
printf("%d\n",f(a,2));
else
printf("%d",f(a,2));
}
}",
"id": 0,
"label": "home"
}
### 数据字段
以下将针对各配置逐一说明每个数据字段。所有划分方式下的数据字段均保持一致。
#### 默认配置
|字段名称| 类型 | 说明 |
|----------|------|----------------------------------------------|
|id |int32 | 样本索引 |
|code |string| 函数的完整文本 |
|label |string| 该源代码所解决问题的编号|
### 数据划分(样本数量)
| 配置名称 | 训练集 | 验证集 | 测试集 |
|-------|----:|---------:|----:|
|default|32000| 8000|12000|
## 数据集构建
### 构建初衷
[需补充更多信息]
### 源数据
#### 初始数据收集与标准化
[需补充更多信息]
#### 源数据生产者
[需补充更多信息]
### 标注信息
#### 标注流程
[需补充更多信息]
#### 标注人员
[需补充更多信息]
### 个人与敏感信息
[需补充更多信息]
## 数据集使用注意事项
### 数据集的社会影响
[需补充更多信息]
### 偏差讨论
[需补充更多信息]
### 其他已知局限性
[需补充更多信息]
## 附加信息
### 数据集维护者
https://github.com/microsoft、https://github.com/madlag
### 许可协议
计算数据使用协议(Computational Use of Data Agreement, C-UDA)许可。
### 引用信息
@inproceedings{mou2016convolutional,
title={Convolutional neural networks over tree structures for programming language processing},
author={Mou, Lili and Li, Ge and Zhang, Lu and Wang, Tao and Jin, Zhi},
booktitle={Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence},
pages={1287--1293},
year={2016}
}
### 贡献说明
感谢@madlag(部分贡献来自@ncoop57)添加此数据集。
提供机构:
maas
创建时间:
2025-04-21



