five

code_x_glue_cc_cloze_testing_all

收藏
魔搭社区2025-12-05 更新2025-04-26 收录
下载链接:
https://modelscope.cn/datasets/google/code_x_glue_cc_cloze_testing_all
下载链接
链接失效反馈
官方服务:
资源简介:
# Dataset Card for "code_x_glue_cc_cloze_testing_all" ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits-sample-size) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/ClozeTesting-all ### Dataset Summary CodeXGLUE ClozeTesting-all dataset, available at https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/ClozeTesting-all Cloze tests are widely adopted in Natural Languages Processing to evaluate the performance of the trained language models. The task is aimed to predict the answers for the blank with the context of the blank, which can be formulated as a multi-choice classification problem. Here we present the two cloze testing datasets in code domain with six different programming languages: ClozeTest-maxmin and ClozeTest-all. Each instance in the dataset contains a masked code function, its docstring and the target word. The only difference between ClozeTest-maxmin and ClozeTest-all is their selected words sets, where ClozeTest-maxmin only contains two words while ClozeTest-all contains 930 words. ### Supported Tasks and Leaderboards - `slot-filling`: The dataset can be used to train a model for predicting the missing token from a piece of code, similar to the Cloze test. ### Languages - Go **programming** language - Java **programming** language - Javascript **programming** language - PHP **programming** language - Python **programming** language - Ruby **programming** language ## Dataset Structure ### Data Instances #### go An example of 'train' looks as follows. ``` { "id": 0, "idx": "all-1", "nl_tokens": ["MarshalJSON", "supports", "json", ".", "Marshaler", "interface"], "pl_tokens": ["func", "(", "v", "ContextRealtimeData", ")", "MarshalJSON", "(", ")", "(", "[", "]", "byte", ",", "error", ")", "{", "w", ":=", "jwriter", ".", "<mask>", "{", "}", "\n", "easyjsonC5a4559bEncodeGithubComChromedpCdprotoWebaudio7", "(", "&", "w", ",", "v", ")", "\n", "return", "w", ".", "Buffer", ".", "BuildBytes", "(", ")", ",", "w", ".", "Error", "\n", "}"] } ``` #### java An example of 'train' looks as follows. ``` { "id": 0, "idx": "all-1", "nl_tokens": ["/", "*", "(", "non", "-", "Javadoc", ")"], "pl_tokens": ["@", "Override", "public", "int", "peekBit", "(", ")", "throws", "AACException", "{", "int", "ret", ";", "if", "(", "bitsCached", ">", "0", ")", "{", "ret", "=", "(", "cache", ">>", "(", "bitsCached", "-", "1", ")", ")", "&", "1", ";", "}", "else", "{", "final", "int", "word", "=", "readCache", "(", "true", ")", ";", "ret", "=", "(", "<mask>", ">>", "WORD_BITS", "-", "1", ")", "&", "1", ";", "}", "return", "ret", ";", "}"] } ``` #### javascript An example of 'train' looks as follows. ``` { "id": 0, "idx": "all-1", "nl_tokens": ["Cast", "query", "params", "according", "to", "type"], "pl_tokens": ["function", "castQueryParams", "(", "relId", ",", "data", ",", "{", "relationships", "}", ")", "{", "const", "relationship", "=", "relationships", "[", "relId", "]", "if", "(", "!", "relationship", ".", "query", ")", "{", "return", "{", "}", "}", "return", "Object", ".", "keys", "(", "relationship", ".", "query", ")", ".", "reduce", "(", "(", "params", ",", "<mask>", ")", "=>", "{", "const", "value", "=", "getField", "(", "data", ",", "relationship", ".", "query", "[", "key", "]", ")", "if", "(", "value", "===", "undefined", ")", "{", "throw", "new", "TypeError", "(", "'Missing value for query param'", ")", "}", "return", "{", "...", "params", ",", "[", "key", "]", ":", "value", "}", "}", ",", "{", "}", ")", "}"] } ``` #### php An example of 'train' looks as follows. ``` { "id": 0, "idx": "all-1", "nl_tokens": ["Get", "choices", "."], "pl_tokens": ["protected", "<mask>", "getChoices", "(", "FormFieldTranslation", "$", "translation", ")", "{", "$", "choices", "=", "preg_split", "(", "'/\\r\\n|\\r|\\n/'", ",", "$", "translation", "->", "getOption", "(", "'choices'", ")", ",", "-", "1", ",", "PREG_SPLIT_NO_EMPTY", ")", ";", "return", "array_combine", "(", "$", "choices", ",", "$", "choices", ")", ";", "}"] } ``` #### python An example of 'train' looks as follows. ``` { "id": 0, "idx": "all-1", "nl_tokens": ["Post", "a", "review"], "pl_tokens": ["def", "post_review", "(", "session", ",", "review", ")", ":", "# POST /api/projects/0.1/reviews/", "<mask>", "=", "make_post_request", "(", "session", ",", "'reviews'", ",", "json_data", "=", "review", ")", "json_data", "=", "response", ".", "json", "(", ")", "if", "response", ".", "status_code", "==", "200", ":", "return", "json_data", "[", "'status'", "]", "else", ":", "raise", "ReviewNotPostedException", "(", "message", "=", "json_data", "[", "'message'", "]", ",", "error_code", "=", "json_data", "[", "'error_code'", "]", ",", "request_id", "=", "json_data", "[", "'request_id'", "]", ")"] } ``` #### ruby An example of 'train' looks as follows. ``` { "id": 0, "idx": "all-1", "nl_tokens": ["By", "default", "taskers", "don", "t", "see", "the", "flor", "variables", "in", "the", "execution", ".", "If", "include_vars", "or", "exclude_vars", "is", "present", "in", "the", "configuration", "of", "the", "tasker", "some", "or", "all", "of", "the", "variables", "are", "passed", "."], "pl_tokens": ["def", "gather_vars", "(", "executor", ",", "tconf", ",", "message", ")", "# try to return before a potentially costly call to executor.vars(nid)", "return", "nil", "if", "(", "tconf", ".", "keys", "&", "%w[", "include_vars", "exclude_vars", "]", ")", ".", "empty?", "# default behaviour, don't pass variables to taskers", "iv", "=", "expand_filter", "(", "tconf", "[", "'include_vars'", "]", ")", "return", "nil", "if", "iv", "==", "false", "ev", "=", "expand_filter", "(", "tconf", "[", "'exclude_vars'", "]", ")", "return", "{", "}", "if", "ev", "==", "true", "vars", "=", "executor", ".", "vars", "(", "message", "[", "'nid'", "]", ")", "return", "vars", "if", "iv", "==", "true", "vars", "=", "vars", ".", "select", "{", "|", "k", ",", "v", "|", "var_match", "(", "k", ",", "iv", ")", "}", "if", "<mask>", "vars", "=", "vars", ".", "reject", "{", "|", "k", ",", "v", "|", "var_match", "(", "k", ",", "ev", ")", "}", "if", "ev", "vars", "end"] } ``` ### Data Fields In the following each data field in go is explained for each config. The data fields are the same among all splits. #### go, java, javascript, php, python, ruby |field name| type | description | |----------|----------------|------------------------------| |id |int32 | Index of the sample | |idx |string | Original index in the dataset| |nl_tokens |Sequence[string]| Natural language tokens | |pl_tokens |Sequence[string]| Programming language tokens | ### Data Splits | name |train| |----------|----:| |go |25282| |java |40492| |javascript|13837| |php |51930| |python |40137| |ruby | 4437| ## Dataset Creation ### Curation Rationale [More Information Needed] ### Source Data #### Initial Data Collection and Normalization Data from CodeSearchNet Challenge dataset. [More Information Needed] #### Who are the source language producers? Software Engineering developers. ### Annotations #### Annotation process [More Information Needed] #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators https://github.com/microsoft, https://github.com/madlag ### Licensing Information Computational Use of Data Agreement (C-UDA) License. ### Citation Information ``` @article{CodeXGLUE, title={CodeXGLUE: An Open Challenge for Code Intelligence}, journal={arXiv}, year={2020}, } @article{feng2020codebert, title={CodeBERT: A Pre-Trained Model for Programming and Natural Languages}, author={Feng, Zhangyin and Guo, Daya and Tang, Duyu and Duan, Nan and Feng, Xiaocheng and Gong, Ming and Shou, Linjun and Qin, Bing and Liu, Ting and Jiang, Daxin and others}, journal={arXiv preprint arXiv:2002.08155}, year={2020} } @article{husain2019codesearchnet, title={CodeSearchNet Challenge: Evaluating the State of Semantic Code Search}, author={Husain, Hamel and Wu, Ho-Hsiang and Gazit, Tiferet and Allamanis, Miltiadis and Brockschmidt, Marc}, journal={arXiv preprint arXiv:1909.09436}, year={2019} } ``` ### Contributions Thanks to @madlag (and partly also @ncoop57) for adding this dataset.

# "code_x_glue_cc_cloze_testing_all" 数据集卡片 ## 目录 - [数据集描述](#dataset-description) - [数据集概述](#dataset-summary) - [支持任务与评测榜单](#supported-tasks-and-leaderboards) - [支持语言](#languages) - [数据集结构](#dataset-structure) - [数据实例](#data-instances) - [数据字段](#data-fields) - [数据划分](#data-splits-sample-size) - [数据集构建](#dataset-creation) - [构建初衷](#curation-rationale) - [数据源](#source-data) - [标注信息](#annotations) - [个人与敏感信息](#personal-and-sensitive-information) - [数据集使用注意事项](#considerations-for-using-the-data) - [数据集的社会影响](#social-impact-of-dataset) - [偏差讨论](#discussion-of-biases) - [其他已知局限](#other-known-limitations) - [附加信息](#additional-information) - [数据集维护者](#dataset-curators) - [授权信息](#licensing-information) - [引用信息](#citation-information) - [贡献者](#contributions) ## 数据集描述 - **主页:** https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/ClozeTesting-all ### 数据集概述 CodeXGLUE ClozeTesting-all 数据集可通过 https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/ClozeTesting-all 获取。完形填空测试(Cloze test)在自然语言处理(Natural Language Processing,NLP)中被广泛用于评估训练后的语言模型性能。该任务旨在利用空白处的上下文预测空白答案,可被建模为多项选择分类问题。本团队构建了面向代码领域的两类完形填空测试数据集:ClozeTest-maxmin 与 ClozeTest-all,涵盖六种不同的编程语言。数据集中的每个实例均包含一个掩码后的代码函数、对应的文档字符串(docstring)以及目标词。两类数据集的唯一区别在于其选用的词集:ClozeTest-maxmin 仅包含2个词,而 ClozeTest-all 包含930个词。 ### 支持任务与评测榜单 - `槽位填充(slot-filling)`:该数据集可用于训练模型,以从代码片段中预测缺失的Token(Token),与完形填空测试任务类似。 ### 支持语言 - Go编程语言 - Java编程语言 - Javascript编程语言 - PHP编程语言 - Python编程语言 - Ruby编程语言 ## 数据集结构 ### 数据实例 #### Go 训练集的一个示例如下所示。 { "id": 0, "idx": "all-1", "nl_tokens": ["MarshalJSON", "supports", "json", ".", "Marshaler", "interface"], "pl_tokens": ["func", "(", "v", "ContextRealtimeData", ")", "MarshalJSON", "(", ")", "(", "[", "]", "byte", ",", "error", ")", "{", "w", ":=", "jwriter", ".", "<mask>", "{", "}", " ", "easyjsonC5a4559bEncodeGithubComChromedpCdprotoWebaudio7", "(", "&", "w", ",", "v", ")", " ", "return", "w", ".", "Buffer", ".", "BuildBytes", "(", ")", ",", "w", ".", "Error", " ", "}"] } #### Java 训练集的一个示例如下所示。 { "id": 0, "idx": "all-1", "nl_tokens": ["/", "*", "(", "non", "-", "Javadoc", ")"], "pl_tokens": ["@", "Override", "public", "int", "peekBit", "(", ")", "throws", "AACException", "{", "int", "ret", ";", "if", "(", "bitsCached", ">", "0", ")", "{", "ret", "=", "(", "cache", ">>", "(", "bitsCached", "-", "1", ")", ")", "&", "1", ";", "}", "else", "{", "final", "int", "word", "=", "readCache", "(", "true", ")", ";", "ret", "=", "(", "<mask>", ">>", "WORD_BITS", "-", "1", ")", "&", "1", ";", "}", "return", "ret", ";", "}"] } #### Javascript 训练集的一个示例如下所示。 { "id": 0, "idx": "all-1", "nl_tokens": ["Cast", "query", "params", "according", "to", "type"], "pl_tokens": ["function", "castQueryParams", "(", "relId", ",", "data", ",", "{", "relationships", "}", ")", "{", "const", "relationship", "=", "relationships", "[", "relId", "]", "if", "(", "!", "relationship", ".", "query", ")", "{", "return", "{", "}", "}", "return", "Object", ".", "keys", "(", "relationship", ".", "query", ")", ".", "reduce", "(", "(", "params", ",", "<mask>", ")", "=>", "{", "const", "value", "=", "getField", "(", "data", ",", "relationship", ".", "query", "[", "key", "]", ")", "if", "(", "value", "===", "undefined", ")", "{", "throw", "new", "TypeError", "(", "'Missing value for query param'", ")", "}", "return", "{", "...", "params", ",", "[", "key", "]", ":", "value", "}", "}", ",", "{", "}", ")", "}"] } #### PHP 训练集的一个示例如下所示。 { "id": 0, "idx": "all-1", "nl_tokens": ["Get", "choices", "."], "pl_tokens": ["protected", "<mask>", "getChoices", "(", "FormFieldTranslation", "$", "translation", ")", "{", "$", "choices", "=", "preg_split", "(", "'/\r\n|\r|\n/'", ",", "$", "translation", "->", "getOption", "(", "'choices'", ")", ",", "-", "1", ",", "PREG_SPLIT_NO_EMPTY", ")", ";", "return", "array_combine", "(", "$", "choices", ",", "$", "choices", ")", ";", "}"] } #### Python 训练集的一个示例如下所示。 { "id": 0, "idx": "all-1", "nl_tokens": ["Post", "a", "review"], "pl_tokens": ["def", "post_review", "(", "session", ",", "review", ")", ":", "# POST /api/projects/0.1/reviews/", "<mask>", "=", "make_post_request", "(", "session", ",", "'reviews'", ",", "json_data", "=", "review", ")", "json_data", "=", "response", ".", "json", "(", ")", "if", "response", ".", "status_code", "==", "200", ":", "return", "json_data", "[", "'status'", "]", "else", ":", "raise", "ReviewNotPostedException", "(", "message", "=", "json_data", "[", "'message'", "]", ",", "error_code", "=", "json_data", "[", "'error_code'", "]", ",", "request_id", "=", "json_data", "[", "'request_id'", "]", ")"] } #### Ruby 训练集的一个示例如下所示。 { "id": 0, "idx": "all-1", "nl_tokens": ["By", "default", "taskers", "don", "t", "see", "the", "flor", "variables", "in", "the", "execution", ".", "If", "include_vars", "or", "exclude_vars", "is", "present", "in", "the", "configuration", "of", "the", "tasker", "some", "or", "all", "of", "the", "variables", "are", "passed", "."], "pl_tokens": ["def", "gather_vars", "(", "executor", ",", "tconf", ",", "message", ")", "# try to return before a potentially costly call to executor.vars(nid)", "return", "nil", "if", "(", "tconf", ".", "keys", "&", "%w[", "include_vars", "exclude_vars", "]", ")", ".", "empty?", "# default behaviour, don't pass variables to taskers", "iv", "=", "expand_filter", "(", "tconf", "[", "'include_vars'", "]", ")", "return", "nil", "if", "iv", "==", "false", "ev", "=", "expand_filter", "(", "tconf", "[", "'exclude_vars'", "]", ")", "return", "{", "}", "if", "ev", "==", "true", "vars", "=", "executor", ".", "vars", "(", "message", "[", "'nid'", "]", ")", "return", "vars", "if", "iv", "==", "true", "vars", "=", "vars", ".", "select", "{", "|", "k", ",", "v", "|", "var_match", "(", "k", ",", "iv", ")", "}", "if", "<mask>", "vars", "=", "vars", ".", "reject", "{", "|", "k", ",", "v", "|", "var_match", "(", "k", ",", "ev", ")", "}", "if", "ev", "vars", "end"] } ### 数据字段 下文将针对每种配置分别解释Go语言对应的数据字段,所有数据划分下的数据字段均保持一致。 #### Go、Java、Javascript、PHP、Python、Ruby |字段名| 类型 | 描述 | |----------|----------------|------------------------------| |id |int32 | 样本索引 | |idx |string | 数据集中的原始索引| |nl_tokens |序列[string]| 自然语言标记(natural language tokens)| |pl_tokens |序列[string]| 编程语言标记(programming language tokens)| ### 数据划分 | 名称 |训练集样本数| |----------|----:| |Go |25282| |Java |40492| |Javascript|13837| |PHP |51930| |Python |40137| |Ruby | 4437| ## 数据集构建 ### 构建初衷 [需要更多信息] ### 数据源 #### 初始数据收集与标准化 数据来源于CodeSearchNet Challenge数据集。 [需要更多信息] #### 源代码生产者是谁? 软件工程开发者。 ### 标注信息 #### 标注流程 [需要更多信息] #### 标注者是谁? [需要更多信息] ### 个人与敏感信息 [需要更多信息] ## 数据集使用注意事项 ### 数据集的社会影响 [需要更多信息] ### 偏差讨论 [需要更多信息] ### 其他已知局限 [需要更多信息] ## 附加信息 ### 数据集维护者 https://github.com/microsoft, https://github.com/madlag ### 授权信息 数据计算使用协议(Computational Use of Data Agreement,C-UDA)许可证。 ### 引用信息 @article{CodeXGLUE, title={CodeXGLUE: An Open Challenge for Code Intelligence}, journal={arXiv}, year={2020}, } @article{feng2020codebert, title={CodeBERT: A Pre-Trained Model for Programming and Natural Languages}, author={Feng, Zhangyin and Guo, Daya and Tang, Duyu and Duan, Nan and Feng, Xiaocheng and Gong, Ming and Shou, Linjun and Qin, Bing and Liu, Ting and Jiang, Daxin and others}, journal={arXiv preprint arXiv:2002.08155}, year={2020} } @article{husain2019codesearchnet, title={CodeSearchNet Challenge: Evaluating the State of Semantic Code Search}, author={Husain, Hamel and Wu, Ho-Hsiang and Gazit, Tiferet and Allamanis, Miltiadis and Brockschmidt, Marc}, journal={arXiv preprint arXiv:1909.09436}, year={2019} } ### 贡献者 感谢@madlag(部分贡献来自@ncoop57)添加本数据集。
提供机构:
maas
创建时间:
2025-04-21
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作