five

ynklab/XCodeSearchNet

收藏
Hugging Face2023-07-12 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/ynklab/XCodeSearchNet
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit language: - en - fr - ja - zh tags: - codesearch pretty_name: XCodeSearchNet --- [Paper on arXiv](https://arxiv.org/abs/2306.15604) ## pre-training data You need to manually combine each dataset if you want to use a multilingual dataset. ```python from datasets import load_dataset xcsn_pt_python_en = load_dataset("ynklab/XCodeSearchNet", data_dir='pretraining/python/en') """ DatasetDict({ train: Dataset({ features: ['function_tokens', 'docstring'], num_rows: 453623 }) validation: Dataset({ features: ['function_tokens', 'docstring'], num_rows: 4596 }) test: Dataset({ features: ['function_tokens', 'docstring'], num_rows: 45283 }) }) """ print(xcsn_pt_python_en['train'][0]) """ { 'function_tokens': ['def', 'get_feature_ide_paths', '(', 'container_dir', ',', 'product_name', ')', ':', 'repo_name', '=', 'get_repo_name', '(', 'container_dir', ')', 'class', 'Paths', '(', 'object', ')', ':', 'feature_order_json', '=', 'os', '.', 'path', '.', 'join', '(', 'container_dir', ',', "'_lib/featuremodel/productline/feature_order.json'", ')', 'model_xml_path', '=', 'os', '.', 'path', '.', 'join', '(', 'container_dir', ',', "'_lib/featuremodel/productline/model.xml'", ')', 'config_file_path', '=', 'os', '.', 'path', '.', 'join', '(', 'container_dir', ',', "'_lib/featuremodel/productline/products/'", ',', 'repo_name', ',', 'product_name', ',', "'product.equation.config'", ')', 'equation_file_path', '=', 'os', '.', 'path', '.', 'join', '(', 'container_dir', ',', "'products'", ',', 'product_name', ',', "'product.equation'", ')', 'product_spec_path', '=', 'os', '.', 'path', '.', 'join', '(', 'container_dir', ',', "'_lib/featuremodel/productline/products/'", ',', 'repo_name', ',', "'product_spec.json'", ')', 'return', 'Paths'], 'docstring': 'Takes the container_dir and the product name and returns all relevant paths from the\n feature_order_json to the config_file_path.\n :param container_dir: the full path of the container dir\n :param product_name: the name of the product\n :return: object with divert path attributes' } """ ``` ## fine-tuning data ```python from datasets import load_dataset xcsn_ft_python_en = load_dataset("ynklab/XCodeSearchNet", data_dir='finetuning/python/en') """ DatasetDict({ train: Dataset({ features: ['text'], num_rows: 1648684 }) validation: Dataset({ features: ['text'], num_rows: 92426 }) }) """ print(xcsn_ft_python_en['train'][0]) """ { 'text': '1<CODESPLIT><CODESPLIT><CODESPLIT>Logs the definition of the object that was just auto - decorated inside the ipython notebook .<CODESPLIT>def _logdef ( self , n , o , otype ) : import re try : #The latest input cell will be the one that this got executed #from. TODO: actually, if acorn got imported after the fact, then #the import would have caused all the undecorated functions to be #decorated as soon as acorn imported. I suppose we just won\'t have #any code for that case. if otype == "classes" : cellno = max ( [ int ( k [ 2 : ] ) for k in self . shell . user_ns . keys ( ) if re . match ( "_i\\d+" , k ) ] ) elif otype == "functions" : cellno = int ( o . __code__ . co_filename . strip ( "<>" ) . split ( \'-\' ) [ 2 ] ) except : #This must not have been an ipython notebook declaration, so we #don\'t store the code. cellno = None pass code = "" if cellno is not None : cellstr = "_i{0:d}" . format ( cellno ) if cellstr in self . shell . user_ns : cellcode = self . shell . user_ns [ cellstr ] import ast astm = ast . parse ( cellcode ) ab = astm . body parts = { ab [ i ] . name : ( ab [ i ] . lineno , None if i + 1 >= len ( ab ) else ab [ i + 1 ] . lineno ) for i , d in enumerate ( ab ) } if n in parts : celllines = cellcode . split ( \'\\n\' ) start , end = parts [ n ] if end is not None : code = celllines [ start - 1 : end - 1 ] else : code = celllines [ start - 1 : ] #Now, we actually create the entry. Since the execution for function #definitions is almost instantaneous, we just log the pre and post #events at the same time. from time import time from acorn . logging . database import record entry = { "m" : "def" , "a" : None , "s" : time ( ) , "r" : None , "c" : code , } from acorn import msg record ( "__main__.{}" . format ( n ) , entry , diff = True ) msg . info ( entry , 1 )' } """ ```
提供机构:
ynklab
原始信息汇总

数据集概述

预训练数据集

  • 名称: XCodeSearchNet 预训练数据集(Python/English)
  • 结构: 包含训练集、验证集和测试集。
    • 训练集: 453,623 行数据
    • 验证集: 4,596 行数据
    • 测试集: 45,283 行数据
  • 特征:
    • function_tokens: 函数代码片段
    • docstring: 函数文档字符串

微调数据集

  • 名称: XCodeSearchNet 微调数据集(Python/English)
  • 结构: 包含训练集和验证集。
    • 训练集: 1,648,684 行数据
    • 验证集: 92,426 行数据
  • 特征:
    • text: 文本数据,包含代码和相关描述

语言支持

  • 支持多种语言: English, French, Japanese, Chinese

许可证

  • MIT 许可证
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作