five

code_x_glue_tc_nl_code_search_adv

收藏
魔搭社区2025-12-05 更新2025-04-26 收录
下载链接:
https://modelscope.cn/datasets/google/code_x_glue_tc_nl_code_search_adv
下载链接
链接失效反馈
官方服务:
资源简介:
# Dataset Card for "code_x_glue_tc_nl_code_search_adv" ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits-sample-size) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** https://github.com/microsoft/CodeXGLUE/tree/main/Text-Code/NL-code-search-Adv - **Paper:** https://arxiv.org/abs/2102.04664 ### Dataset Summary CodeXGLUE NL-code-search-Adv dataset, available at https://github.com/microsoft/CodeXGLUE/tree/main/Text-Code/NL-code-search-Adv The dataset we use comes from CodeSearchNet and we filter the dataset as the following: - Remove examples that codes cannot be parsed into an abstract syntax tree. - Remove examples that #tokens of documents is < 3 or >256 - Remove examples that documents contain special tokens (e.g. <img ...> or https:...) - Remove examples that documents are not English. ### Supported Tasks and Leaderboards - `document-retrieval`: The dataset can be used to train a model for retrieving top-k codes from a given **English** natural language query. ### Languages - Python **programming** language - English **natural** language ## Dataset Structure ### Data Instances An example of 'validation' looks as follows. ``` { "argument_list": "", "code": "def Func(arg_0, arg_1='.', arg_2=True, arg_3=False, **arg_4):\n \"\"\"Downloads Dailymotion videos by URL.\n \"\"\"\n\n arg_5 = get_content(rebuilt_url(arg_0))\n arg_6 = json.loads(match1(arg_5, r'qualities\":({.+?}),\"'))\n arg_7 = match1(arg_5, r'\"video_title\"\\s*:\\s*\"([^\"]+)\"') or \\\n match1(arg_5, r'\"title\"\\s*:\\s*\"([^\"]+)\"')\n arg_7 = unicodize(arg_7)\n\n for arg_8 in ['1080','720','480','380','240','144','auto']:\n try:\n arg_9 = arg_6[arg_8][1][\"url\"]\n if arg_9:\n break\n except KeyError:\n pass\n\n arg_10, arg_11, arg_12 = url_info(arg_9)\n\n print_info(site_info, arg_7, arg_10, arg_12)\n if not arg_3:\n download_urls([arg_9], arg_7, arg_11, arg_12, arg_1=arg_1, arg_2=arg_2)", "code_tokens": ["def", "Func", "(", "arg_0", ",", "arg_1", "=", "'.'", ",", "arg_2", "=", "True", ",", "arg_3", "=", "False", ",", "**", "arg_4", ")", ":", "arg_5", "=", "get_content", "(", "rebuilt_url", "(", "arg_0", ")", ")", "arg_6", "=", "json", ".", "loads", "(", "match1", "(", "arg_5", ",", "r'qualities\":({.+?}),\"'", ")", ")", "arg_7", "=", "match1", "(", "arg_5", ",", "r'\"video_title\"\\s*:\\s*\"([^\"]+)\"'", ")", "or", "match1", "(", "arg_5", ",", "r'\"title\"\\s*:\\s*\"([^\"]+)\"'", ")", "arg_7", "=", "unicodize", "(", "arg_7", ")", "for", "arg_8", "in", "[", "'1080'", ",", "'720'", ",", "'480'", ",", "'380'", ",", "'240'", ",", "'144'", ",", "'auto'", "]", ":", "try", ":", "arg_9", "=", "arg_6", "[", "arg_8", "]", "[", "1", "]", "[", "\"url\"", "]", "if", "arg_9", ":", "break", "except", "KeyError", ":", "pass", "arg_10", ",", "arg_11", ",", "arg_12", "=", "url_info", "(", "arg_9", ")", "print_info", "(", "site_info", ",", "arg_7", ",", "arg_10", ",", "arg_12", ")", "if", "not", "arg_3", ":", "download_urls", "(", "[", "arg_9", "]", ",", "arg_7", ",", "arg_11", ",", "arg_12", ",", "arg_1", "=", "arg_1", ",", "arg_2", "=", "arg_2", ")"], "docstring": "Downloads Dailymotion videos by URL.", "docstring_summary": "Downloads Dailymotion videos by URL.", "docstring_tokens": ["Downloads", "Dailymotion", "videos", "by", "URL", "."], "func_name": "", "id": 0, "identifier": "dailymotion_download", "language": "python", "nwo": "soimort/you-get", "original_string": "", "parameters": "(url, output_dir='.', merge=True, info_only=False, **kwargs)", "path": "src/you_get/extractors/dailymotion.py", "repo": "", "return_statement": "", "score": 0.9997601509094238, "sha": "b746ac01c9f39de94cac2d56f665285b0523b974", "url": "https://github.com/soimort/you-get/blob/b746ac01c9f39de94cac2d56f665285b0523b974/src/you_get/extractors/dailymotion.py#L13-L35" } ``` ### Data Fields In the following each data field in go is explained for each config. The data fields are the same among all splits. #### default | field name | type | description | |-----------------|-----------------------|-----------------------------------------------------------------------------------| |id |int32 | Index of the sample | |repo |string | repo: the owner/repo | |path |string | path: the full path to the original file | |func_name |string | func_name: the function or method name | |original_string |string | original_string: the raw string before tokenization or parsing | |language |string | language: the programming language | |code |string | code/function: the part of the original_string that is code | |code_tokens |Sequence[string] | code_tokens/function_tokens: tokenized version of code | |docstring |string | docstring: the top-level comment or docstring, if it exists in the original string| |docstring_tokens |Sequence[string] | docstring_tokens: tokenized version of docstring | |sha |string | sha of the file | |url |string | url of the file | |docstring_summary|string | Summary of the docstring | |parameters |string | parameters of the function | |return_statement |string | return statement | |argument_list |string | list of arguments of the function | |identifier |string | identifier | |nwo |string | nwo | |score |datasets.Value("float"]| score for this search | ### Data Splits | name |train |validation|test | |-------|-----:|---------:|----:| |default|251820| 9604|19210| ## Dataset Creation ### Curation Rationale [More Information Needed] ### Source Data #### Initial Data Collection and Normalization Data from CodeSearchNet Challenge dataset. [More Information Needed] #### Who are the source language producers? Software Engineering developers. ### Annotations #### Annotation process [More Information Needed] #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators https://github.com/microsoft, https://github.com/madlag ### Licensing Information Computational Use of Data Agreement (C-UDA) License. ### Citation Information ``` @article{DBLP:journals/corr/abs-2102-04664, author = {Shuai Lu and Daya Guo and Shuo Ren and Junjie Huang and Alexey Svyatkovskiy and Ambrosio Blanco and Colin B. Clement and Dawn Drain and Daxin Jiang and Duyu Tang and Ge Li and Lidong Zhou and Linjun Shou and Long Zhou and Michele Tufano and Ming Gong and Ming Zhou and Nan Duan and Neel Sundaresan and Shao Kun Deng and Shengyu Fu and Shujie Liu}, title = {CodeXGLUE: {A} Machine Learning Benchmark Dataset for Code Understanding and Generation}, journal = {CoRR}, volume = {abs/2102.04664}, year = {2021} } @article{husain2019codesearchnet, title={Codesearchnet challenge: Evaluating the state of semantic code search}, author={Husain, Hamel and Wu, Ho-Hsiang and Gazit, Tiferet and Allamanis, Miltiadis and Brockschmidt, Marc}, journal={arXiv preprint arXiv:1909.09436}, year={2019} } ``` ### Contributions Thanks to @madlag (and partly also @ncoop57) for adding this dataset.

# 「code_x_glue_tc_nl_code_search_adv」数据集卡片 ## 目录 - [数据集描述](#dataset-description) - [数据集摘要](#dataset-summary) - [支持任务与排行榜](#supported-tasks-and-leaderboards) - [语言](#languages) - [数据集结构](#dataset-structure) - [数据实例](#data-instances) - [数据字段](#data-fields) - [数据划分](#data-splits-sample-size) - [数据集构建](#dataset-creation) - [构建依据](#curation-rationale) - [源数据](#source-data) - [标注](#annotations) - [个人与敏感信息](#personal-and-sensitive-information) - [数据集使用注意事项](#considerations-for-using-the-data) - [数据集的社会影响](#social-impact-of-dataset) - [偏差讨论](#discussion-of-biases) - [其他已知局限性](#other-known-limitations) - [附加信息](#additional-information) - [数据集维护者](#dataset-curators) - [许可信息](#licensing-information) - [引用信息](#citation-information) - [贡献](#contributions) ## 数据集描述 - **主页:** https://github.com/microsoft/CodeXGLUE/tree/main/Text-Code/NL-code-search-Adv - **论文:** https://arxiv.org/abs/2102.04664 ### 数据集摘要 代码理解与生成机器学习基准数据集CodeXGLUE(CodeXGLUE)的NL-code-search-Adv 子数据集,可通过 https://github.com/microsoft/CodeXGLUE/tree/main/Text-Code/NL-code-search-Adv 获取。 本数据集源自代码搜索网络(CodeSearchNet)挑战赛数据集,并按以下规则进行过滤: - 移除无法解析为抽象语法树(Abstract Syntax Tree,AST)的代码示例 - 移除文档Token数量小于3或大于256的示例 - 移除包含特殊标记(如 `<img ...>` 或 `https:...` 等)的文档示例 - 移除非英语的文档示例 ### 支持任务与排行榜 - `document-retrieval`(文档检索):该数据集可用于训练模型,以根据给定的**英语**自然语言查询检索Top-K代码。 ### 语言 - Python 编程语言 - 英语 自然语言 ## 数据集结构 ### 数据实例 一个「验证集(validation)」的示例如下: { "argument_list": "", "code": "def Func(arg_0, arg_1='.', arg_2=True, arg_3=False, **arg_4): """Downloads Dailymotion videos by URL. """ arg_5 = get_content(rebuilt_url(arg_0)) arg_6 = json.loads(match1(arg_5, r'qualities":({.+?}),"')) arg_7 = match1(arg_5, r'"video_title"\s*:\s*"([^"]+)"') or \ match1(arg_5, r'"title"\s*:\s*"([^"]+)"') arg_7 = unicodize(arg_7) for arg_8 in ['1080','720','480','380','240','144','auto']: try: arg_9 = arg_6[arg_8][1]["url"] if arg_9: break except KeyError: pass arg_10, arg_11, arg_12 = url_info(arg_9) print_info(site_info, arg_7, arg_10, arg_12) if not arg_3: download_urls([arg_9], arg_7, arg_11, arg_12, arg_1=arg_1, arg_2=arg_2)", "code_tokens": ["def", "Func", "(", "arg_0", ",", "arg_1", "=", "'.'", ",", "arg_2", "=", "True", ",", "arg_3", "=", "False", ",", "**", "arg_4", ")", ":", "arg_5", "=", "get_content", "(", "rebuilt_url", "(", "arg_0", ")", ")", "arg_6", "=", "json", ".", "loads", "(", "match1", "(", "arg_5", ",", "r'qualities":({.+?}),"'", ")", ")", "arg_7", "=", "match1", "(", "arg_5", ",", "r'"video_title"\s*:\s*"([^"]+)"'", ")", "or", "match1", "(", "arg_5", ",", "r'"title"\s*:\s*"([^"]+)"'", ")", "arg_7", "=", "unicodize", "(", "arg_7", ")", "for", "arg_8", "in", "[", "'1080'", ",", "'720'", ",", "'480'", ",", "'380'", ",", "'240'", ",", "'144'", ",", "'auto'", "]", ":", "try", ":", "arg_9", "=", "arg_6", "[", "arg_8", "]", "[", "1", "]", "[", ""url"", "]", "if", "arg_9", ":", "break", "except", "KeyError", ":", "pass", "arg_10", ",", "arg_11", ",", "arg_12", "=", "url_info", "(", "arg_9", ")", "print_info", "(", "site_info", ",", "arg_7", ",", "arg_10", ",", "arg_12", ")", "if", "not", "arg_3", ":", "download_urls", "(", "[", "arg_9", "]", ",", "arg_7", ",", "arg_11", ",", "arg_12", ",", "arg_1", "=", "arg_1", ",", "arg_2", "=", "arg_2", ")"], "docstring": "Downloads Dailymotion videos by URL.", "docstring_summary": "Downloads Dailymotion videos by URL.", "docstring_tokens": ["Downloads", "Dailymotion", "videos", "by", "URL", "."], "func_name": "", "id": 0, "identifier": "dailymotion_download", "language": "python", "nwo": "soimort/you-get", "original_string": "", "parameters": "(url, output_dir='.', merge=True, info_only=False, **kwargs)", "path": "src/you_get/extractors/dailymotion.py", "repo": "", "return_statement": "", "score": 0.9997601509094238, "sha": "b746ac01c9f39de94cac2d56f665285b0523b974", "url": "https://github.com/soimort/you-get/blob/b746ac01c9f39de94cac2d56f665285b0523b974/src/you_get/extractors/dailymotion.py#L13-L35" } ### 数据字段 以下将针对各配置逐一解释每个数据字段。所有数据划分下的数据字段均保持一致。 #### 默认配置 | 字段名 | 类型 | 描述 | |-----------------|--------------------------|----------------------------------------------------------------------| | id | int32 | 样本索引 | | repo | string | 仓库:所有者/仓库名称 | | path | string | 文件路径:原始代码文件的完整路径 | | func_name | string | 函数名:函数或方法的名称 | | original_string | string | 原始字符串:分词或解析前的原始文本 | | language | string | 编程语言:所用的编程语言 | | code | string | 代码/函数:原始字符串中属于代码的部分 | | code_tokens | Sequence[string] | 代码分词:代码的分词版本 | | docstring | string | 文档字符串:原始字符串中存在的顶层注释或文档字符串 | | docstring_tokens| Sequence[string] | 文档字符串分词:文档字符串的分词版本 | | sha | string | 文件的SHA哈希值 | | url | string | 文件的访问链接 | | docstring_summary| string | 文档字符串摘要:文档字符串的摘要内容 | | parameters | string | 函数参数:函数的参数列表 | | return_statement| string | 返回语句:函数的返回语句 | | argument_list | string | 函数参数列表:函数的参数列表 | | identifier | string | 标识符 | | nwo | string | nwo(保留原名) | | score | datasets.Value("float") | 检索得分:该样本的检索得分 | ### 数据划分 | 划分名称 | 训练集 | 验证集 | 测试集 | |---------|--------:|--------:|--------:| | default | 251820 | 9604 | 19210 | ## 数据集构建 ### 构建依据 [需补充更多信息] ### 源数据 #### 初始数据收集与标准化 数据源自代码搜索网络(CodeSearchNet)挑战赛数据集。[需补充更多信息] #### 源语言生产者是谁? 软件工程开发者。 ### 标注 #### 标注流程 [需补充更多信息] #### 标注人员是谁? [需补充更多信息] ### 个人与敏感信息 [需补充更多信息] ## 数据集使用注意事项 ### 数据集的社会影响 [需补充更多信息] ### 偏差讨论 [需补充更多信息] ### 其他已知局限性 [需补充更多信息] ## 附加信息 ### 数据集维护者 https://github.com/microsoft, https://github.com/madlag ### 许可信息 计算数据使用协议(Computational Use of Data Agreement,C-UDA)许可证。 ### 引用信息 @article{DBLP:journals/corr/abs-2102.04664, author = {Shuai Lu and Daya Guo and Shuo Ren and Junjie Huang and Alexey Svyatkovskiy and Ambrosio Blanco and Colin B. Clement and Dawn Drain and Daxin Jiang and Duyu Tang and Ge Li and Lidong Zhou and Linjun Shou and Long Zhou and Michele Tufano and Ming Gong and Ming Zhou and Nan Duan and Neel Sundaresan and Shao Kun Deng and Shengyu Fu and Shujie Liu}, title = {CodeXGLUE: {A} Machine Learning Benchmark Dataset for Code Understanding and Generation}, journal = {CoRR}, volume = {abs/2102.04664}, year = {2021} } @article{husain2019codesearchnet, title={Codesearchnet challenge: Evaluating the state of semantic code search}, author={Husain, Hamel and Wu, Ho-Hsiang and Gazit, Tiferet and Allamanis, Miltiadis and Brockschmidt, Marc}, journal={arXiv preprint arXiv:1909.09436}, year={2019} } ### 贡献 感谢 @madlag(部分贡献来自 @ncoop57)添加此数据集。
提供机构:
maas
创建时间:
2025-04-21
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作