google/code_x_glue_tc_nl_code_search_adv

Name: google/code_x_glue_tc_nl_code_search_adv
Creator: google
Published: 2024-01-24 15:15:07
License: 暂无描述

Hugging Face2024-01-24 更新2024-05-25 收录

下载链接：

https://hf-mirror.com/datasets/google/code_x_glue_tc_nl_code_search_adv

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - found language_creators: - found language: - code - en license: - c-uda multilinguality: - other-programming-languages size_categories: - 100K<n<1M source_datasets: - original task_categories: - text-retrieval task_ids: - document-retrieval pretty_name: CodeXGlueTcNlCodeSearchAdv dataset_info: features: - name: id dtype: int32 - name: repo dtype: string - name: path dtype: string - name: func_name dtype: string - name: original_string dtype: string - name: language dtype: string - name: code dtype: string - name: code_tokens sequence: string - name: docstring dtype: string - name: docstring_tokens sequence: string - name: sha dtype: string - name: url dtype: string - name: docstring_summary dtype: string - name: parameters dtype: string - name: return_statement dtype: string - name: argument_list dtype: string - name: identifier dtype: string - name: nwo dtype: string - name: score dtype: float32 splits: - name: train num_bytes: 820714108 num_examples: 251820 - name: validation num_bytes: 23468758 num_examples: 9604 - name: test num_bytes: 47433608 num_examples: 19210 download_size: 316235421 dataset_size: 891616474 configs: - config_name: default data_files: - split: train path: data/train-* - split: validation path: data/validation-* - split: test path: data/test-* --- # Dataset Card for "code_x_glue_tc_nl_code_search_adv" ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits-sample-size) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** https://github.com/microsoft/CodeXGLUE/tree/main/Text-Code/NL-code-search-Adv - **Paper:** https://arxiv.org/abs/2102.04664 ### Dataset Summary CodeXGLUE NL-code-search-Adv dataset, available at https://github.com/microsoft/CodeXGLUE/tree/main/Text-Code/NL-code-search-Adv The dataset we use comes from CodeSearchNet and we filter the dataset as the following: - Remove examples that codes cannot be parsed into an abstract syntax tree. - Remove examples that #tokens of documents is < 3 or >256 - Remove examples that documents contain special tokens (e.g. <img ...> or https:...) - Remove examples that documents are not English. ### Supported Tasks and Leaderboards - `document-retrieval`: The dataset can be used to train a model for retrieving top-k codes from a given **English** natural language query. ### Languages - Python **programming** language - English **natural** language ## Dataset Structure ### Data Instances An example of 'validation' looks as follows. ``` { "argument_list": "", "code": "def Func(arg_0, arg_1='.', arg_2=True, arg_3=False, **arg_4):\n \"\"\"Downloads Dailymotion videos by URL.\n \"\"\"\n\n arg_5 = get_content(rebuilt_url(arg_0))\n arg_6 = json.loads(match1(arg_5, r'qualities\":({.+?}),\"'))\n arg_7 = match1(arg_5, r'\"video_title\"\\s*:\\s*\"([^\"]+)\"') or \\\n match1(arg_5, r'\"title\"\\s*:\\s*\"([^\"]+)\"')\n arg_7 = unicodize(arg_7)\n\n for arg_8 in ['1080','720','480','380','240','144','auto']:\n try:\n arg_9 = arg_6[arg_8][1][\"url\"]\n if arg_9:\n break\n except KeyError:\n pass\n\n arg_10, arg_11, arg_12 = url_info(arg_9)\n\n print_info(site_info, arg_7, arg_10, arg_12)\n if not arg_3:\n download_urls([arg_9], arg_7, arg_11, arg_12, arg_1=arg_1, arg_2=arg_2)", "code_tokens": ["def", "Func", "(", "arg_0", ",", "arg_1", "=", "'.'", ",", "arg_2", "=", "True", ",", "arg_3", "=", "False", ",", "**", "arg_4", ")", ":", "arg_5", "=", "get_content", "(", "rebuilt_url", "(", "arg_0", ")", ")", "arg_6", "=", "json", ".", "loads", "(", "match1", "(", "arg_5", ",", "r'qualities\":({.+?}),\"'", ")", ")", "arg_7", "=", "match1", "(", "arg_5", ",", "r'\"video_title\"\\s*:\\s*\"([^\"]+)\"'", ")", "or", "match1", "(", "arg_5", ",", "r'\"title\"\\s*:\\s*\"([^\"]+)\"'", ")", "arg_7", "=", "unicodize", "(", "arg_7", ")", "for", "arg_8", "in", "[", "'1080'", ",", "'720'", ",", "'480'", ",", "'380'", ",", "'240'", ",", "'144'", ",", "'auto'", "]", ":", "try", ":", "arg_9", "=", "arg_6", "[", "arg_8", "]", "[", "1", "]", "[", "\"url\"", "]", "if", "arg_9", ":", "break", "except", "KeyError", ":", "pass", "arg_10", ",", "arg_11", ",", "arg_12", "=", "url_info", "(", "arg_9", ")", "print_info", "(", "site_info", ",", "arg_7", ",", "arg_10", ",", "arg_12", ")", "if", "not", "arg_3", ":", "download_urls", "(", "[", "arg_9", "]", ",", "arg_7", ",", "arg_11", ",", "arg_12", ",", "arg_1", "=", "arg_1", ",", "arg_2", "=", "arg_2", ")"], "docstring": "Downloads Dailymotion videos by URL.", "docstring_summary": "Downloads Dailymotion videos by URL.", "docstring_tokens": ["Downloads", "Dailymotion", "videos", "by", "URL", "."], "func_name": "", "id": 0, "identifier": "dailymotion_download", "language": "python", "nwo": "soimort/you-get", "original_string": "", "parameters": "(url, output_dir='.', merge=True, info_only=False, **kwargs)", "path": "src/you_get/extractors/dailymotion.py", "repo": "", "return_statement": "", "score": 0.9997601509094238, "sha": "b746ac01c9f39de94cac2d56f665285b0523b974", "url": "https://github.com/soimort/you-get/blob/b746ac01c9f39de94cac2d56f665285b0523b974/src/you_get/extractors/dailymotion.py#L13-L35" } ``` ### Data Fields In the following each data field in go is explained for each config. The data fields are the same among all splits. #### default | field name | type | description | |-----------------|-----------------------|-----------------------------------------------------------------------------------| |id |int32 | Index of the sample | |repo |string | repo: the owner/repo | |path |string | path: the full path to the original file | |func_name |string | func_name: the function or method name | |original_string |string | original_string: the raw string before tokenization or parsing | |language |string | language: the programming language | |code |string | code/function: the part of the original_string that is code | |code_tokens |Sequence[string] | code_tokens/function_tokens: tokenized version of code | |docstring |string | docstring: the top-level comment or docstring, if it exists in the original string| |docstring_tokens |Sequence[string] | docstring_tokens: tokenized version of docstring | |sha |string | sha of the file | |url |string | url of the file | |docstring_summary|string | Summary of the docstring | |parameters |string | parameters of the function | |return_statement |string | return statement | |argument_list |string | list of arguments of the function | |identifier |string | identifier | |nwo |string | nwo | |score |datasets.Value("float"]| score for this search | ### Data Splits | name |train |validation|test | |-------|-----:|---------:|----:| |default|251820| 9604|19210| ## Dataset Creation ### Curation Rationale [More Information Needed] ### Source Data #### Initial Data Collection and Normalization Data from CodeSearchNet Challenge dataset. [More Information Needed] #### Who are the source language producers? Software Engineering developers. ### Annotations #### Annotation process [More Information Needed] #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators https://github.com/microsoft, https://github.com/madlag ### Licensing Information Computational Use of Data Agreement (C-UDA) License. ### Citation Information ``` @article{DBLP:journals/corr/abs-2102-04664, author = {Shuai Lu and Daya Guo and Shuo Ren and Junjie Huang and Alexey Svyatkovskiy and Ambrosio Blanco and Colin B. Clement and Dawn Drain and Daxin Jiang and Duyu Tang and Ge Li and Lidong Zhou and Linjun Shou and Long Zhou and Michele Tufano and Ming Gong and Ming Zhou and Nan Duan and Neel Sundaresan and Shao Kun Deng and Shengyu Fu and Shujie Liu}, title = {CodeXGLUE: {A} Machine Learning Benchmark Dataset for Code Understanding and Generation}, journal = {CoRR}, volume = {abs/2102.04664}, year = {2021} } @article{husain2019codesearchnet, title={Codesearchnet challenge: Evaluating the state of semantic code search}, author={Husain, Hamel and Wu, Ho-Hsiang and Gazit, Tiferet and Allamanis, Miltiadis and Brockschmidt, Marc}, journal={arXiv preprint arXiv:1909.09436}, year={2019} } ``` ### Contributions Thanks to @madlag (and partly also @ncoop57) for adding this dataset.

提供机构：

google

原始信息汇总

数据集概述

数据集名称

名称: CodeXGlueTcNlCodeSearchAdv
别名: CodeXGLUE NL-code-search-Adv

数据集特征

语言:
- 编程语言: Python
- 自然语言: English
许可证: C-UDA
多语言性: 其他编程语言
大小类别: 100K<n<1M
源数据集: 原始数据
任务类别: 文本检索
任务ID: 文档检索

数据集结构

数据实例: 包含多个字段，如id, repo, path, func_name等，详细描述了代码及其相关信息。
数据字段:
- id: int32
- repo: string
- path: string
- func_name: string
- original_string: string
- language: string
- code: string
- code_tokens: sequence[string]
- docstring: string
- docstring_tokens: sequence[string]
- sha: string
- url: string
- docstring_summary: string
- parameters: string
- return_statement: string
- argument_list: string
- identifier: string
- nwo: string
- score: float32

数据集分割

训练集: 251820个样本，820714108字节
验证集: 9604个样本，23468758字节
测试集: 19210个样本，47433608字节

数据集创建

源数据: 来自CodeSearchNet Challenge数据集
语言生产者: 软件工程开发者

许可证信息

许可证: 计算数据使用协议(C-UDA)

引用信息

@article{DBLP:journals/corr/abs-2102-04664, author = {Shuai Lu and Daya Guo and Shuo Ren and Junjie Huang and Alexey Svyatkovskiy and Ambrosio Blanco and Colin B. Clement and Dawn Drain and Daxin Jiang and Duyu Tang and Ge Li and Lidong Zhou and Linjun Shou and Long Zhou and Michele Tufano and Ming Gong and Ming Zhou and Nan Duan and Neel Sundaresan and Shao Kun Deng and Shengyu Fu and Shujie Liu}, title = {CodeXGLUE: {A} Machine Learning Benchmark Dataset for Code Understanding and Generation}, journal = {CoRR}, volume = {abs/2102.04664}, year = {2021} } @article{husain2019codesearchnet, title={Codesearchnet challenge: Evaluating the state of semantic code search}, author={Husain, Hamel and Wu, Ho-Hsiang and Gazit, Tiferet and Allamanis, Miltiadis and Brockschmidt, Marc}, journal={arXiv preprint arXiv:1909.09436}, year={2019} }

搜集汇总

数据集介绍

背景与挑战

背景概述

该数据集是CodeXGLUE项目中的NL-code-search-Adv子集，专门用于代码检索任务，旨在通过自然语言查询匹配相关Python代码片段。它基于CodeSearchNet数据构建，经过严格过滤以确保数据质量，包含约28万条代码-文档对，支持训练、验证和测试分割。数据集特点是聚焦于英语自然语言和Python编程语言的对应关系，适用于文档检索模型的训练和评估。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集