five

code_x_glue_cc_clone_detection_poj104

收藏
魔搭社区2025-10-09 更新2025-04-26 收录
下载链接:
https://modelscope.cn/datasets/google/code_x_glue_cc_clone_detection_poj104
下载链接
链接失效反馈
官方服务:
资源简介:
# Dataset Card for "code_x_glue_cc_clone_detection_poj_104" ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits-sample-size) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/Clone-detection-POJ-104 ### Dataset Summary CodeXGLUE Clone-detection-POJ-104 dataset, available at https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/Clone-detection-POJ-104 Given a code and a collection of candidates as the input, the task is to return Top K codes with the same semantic. Models are evaluated by MAP score. We use POJ-104 dataset on this task. ### Supported Tasks and Leaderboards - `document-retrieval`: The dataset can be used to train a model for retrieving top-k codes with the same semantics. ### Languages - C++ **programming** language ## Dataset Structure ### Data Instances An example of 'train' looks as follows. ``` { "code": "\nint f(int shu,int min)\n{ \n int k=1;\n if(shu < min)\n { \n k= 0; \n return k;\n } \n else\n {\n for(int i = min;i<shu;i++)\n { \n if(shu%i == 0)\n { \n k=k+ f(shu/i,i); \n } \n \n \n } \n return k; \n}\n} \n\nmain()\n{\n int n,i,a;\n scanf(\"%d\",&n);\n \n for(i=0;i<n;i++)\n {\n scanf(\"%d\",&a);\n \n if(i!=n-1) \n printf(\"%d\\n\",f(a,2));\n else\n printf(\"%d\",f(a,2)); \n \n \n \n } \n \n \n }", "id": 0, "label": "home" } ``` ### Data Fields In the following each data field in go is explained for each config. The data fields are the same among all splits. #### default |field name| type | description | |----------|------|----------------------------------------------| |id |int32 | Index of the sample | |code |string| The full text of the function | |label |string| The id of problem that the source code solves| ### Data Splits | name |train|validation|test | |-------|----:|---------:|----:| |default|32000| 8000|12000| ## Dataset Creation ### Curation Rationale [More Information Needed] ### Source Data #### Initial Data Collection and Normalization [More Information Needed] #### Who are the source language producers? [More Information Needed] ### Annotations #### Annotation process [More Information Needed] #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators https://github.com/microsoft, https://github.com/madlag ### Licensing Information Computational Use of Data Agreement (C-UDA) License. ### Citation Information ``` @inproceedings{mou2016convolutional, title={Convolutional neural networks over tree structures for programming language processing}, author={Mou, Lili and Li, Ge and Zhang, Lu and Wang, Tao and Jin, Zhi}, booktitle={Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence}, pages={1287--1293}, year={2016} } ``` ### Contributions Thanks to @madlag (and partly also @ncoop57) for adding this dataset.

# 「code_x_glue_cc_clone_detection_poj_104」数据集卡片 ## 目录 - [数据集描述](#数据集描述) - [数据集摘要](#数据集摘要) - [支持的任务与排行榜](#支持的任务与排行榜) - [使用语言](#使用语言) - [数据集结构](#数据集结构) - [数据实例](#数据实例) - [数据字段](#数据字段) - [数据划分(样本数量)](#数据划分样本数量) - [数据集构建](#数据集构建) - [构建初衷](#构建初衷) - [源数据](#源数据) - [标注信息](#标注信息) - [个人与敏感信息](#个人与敏感信息) - [数据集使用注意事项](#数据集使用注意事项) - [数据集的社会影响](#数据集的社会影响) - [偏差讨论](#偏差讨论) - [其他已知局限性](#其他已知局限性) - [附加信息](#附加信息) - [数据集维护者](#数据集维护者) - [许可协议](#许可协议) - [引用信息](#引用信息) - [贡献说明](#贡献说明) ## 数据集描述 - **主页**:https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/Clone-detection-POJ-104 ### 数据集摘要 代码XGLUE(CodeXGLUE)克隆检测-POJ-104数据集,可从以下地址获取:https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/Clone-detection-POJ-104 该任务以一段代码与一组候选代码作为输入,目标是返回语义相同的Top K段代码。模型采用平均精度均值(Mean Average Precision, MAP)指标进行评估。本任务使用POJ-104数据集。 ### 支持的任务与排行榜 - `文档检索`:该数据集可用于训练模型以检索语义相同的Top-K段代码。 ### 使用语言 - C++ 编程语言 ## 数据集结构 ### 数据实例 训练集的一个数据示例如下: { "code": " int f(int shu,int min) { int k=1; if(shu < min) { k= 0; return k; } else { for(int i = min;i<shu;i++) { if(shu%i == 0) { k=k+ f(shu/i,i); } } return k; } } main() { int n,i,a; scanf("%d",&n); for(i=0;i<n;i++) { scanf("%d",&a); if(i!=n-1) printf("%d\n",f(a,2)); else printf("%d",f(a,2)); } }", "id": 0, "label": "home" } ### 数据字段 以下将针对各配置逐一说明每个数据字段。所有划分方式下的数据字段均保持一致。 #### 默认配置 |字段名称| 类型 | 说明 | |----------|------|----------------------------------------------| |id |int32 | 样本索引 | |code |string| 函数的完整文本 | |label |string| 该源代码所解决问题的编号| ### 数据划分(样本数量) | 配置名称 | 训练集 | 验证集 | 测试集 | |-------|----:|---------:|----:| |default|32000| 8000|12000| ## 数据集构建 ### 构建初衷 [需补充更多信息] ### 源数据 #### 初始数据收集与标准化 [需补充更多信息] #### 源数据生产者 [需补充更多信息] ### 标注信息 #### 标注流程 [需补充更多信息] #### 标注人员 [需补充更多信息] ### 个人与敏感信息 [需补充更多信息] ## 数据集使用注意事项 ### 数据集的社会影响 [需补充更多信息] ### 偏差讨论 [需补充更多信息] ### 其他已知局限性 [需补充更多信息] ## 附加信息 ### 数据集维护者 https://github.com/microsoft、https://github.com/madlag ### 许可协议 计算数据使用协议(Computational Use of Data Agreement, C-UDA)许可。 ### 引用信息 @inproceedings{mou2016convolutional, title={Convolutional neural networks over tree structures for programming language processing}, author={Mou, Lili and Li, Ge and Zhang, Lu and Wang, Tao and Jin, Zhi}, booktitle={Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence}, pages={1287--1293}, year={2016} } ### 贡献说明 感谢@madlag(部分贡献来自@ncoop57)添加此数据集。
提供机构:
maas
创建时间:
2025-04-21
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作