five

Nan-Do/code-search-net-javascript

收藏
Hugging Face2023-05-15 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/Nan-Do/code-search-net-javascript
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: repo dtype: string - name: path dtype: string - name: func_name dtype: string - name: original_string dtype: string - name: language dtype: string - name: code dtype: string - name: code_tokens sequence: string - name: docstring dtype: string - name: docstring_tokens sequence: string - name: sha dtype: string - name: url dtype: string - name: partition dtype: string - name: summary dtype: string splits: - name: train num_bytes: 543032741 num_examples: 138155 download_size: 182237165 dataset_size: 543032741 license: apache-2.0 task_categories: - text-generation - text2text-generation - summarization language: - en tags: - code - javascript - CodeSearchNet - summary pretty_name: JavaScript CodeSearchNet with Summaries --- # Dataset Card for "code-search-net-javascript" ## Dataset Description - **Homepage:** None - **Repository:** https://huggingface.co/datasets/Nan-Do/code-search-net-JavaScript - **Paper:** None - **Leaderboard:** None - **Point of Contact:** [@Nan-Do](https://github.com/Nan-Do) ### Dataset Summary This dataset is the JavaScript portion of the CodeSarchNet annotated with a summary column. The code-search-net dataset includes open source functions that include comments found at GitHub. The summary is a short description of what the function does. ### Languages The dataset's comments are in English and the functions are coded in JavaScript ### Data Splits Train, test, validation labels are included in the dataset as a column. ## Dataset Creation May of 2023 ### Curation Rationale This dataset can be used to generate instructional (or many other interesting) datasets that are useful to train LLMs ### Source Data The CodeSearchNet dataset can be found at https://www.kaggle.com/datasets/omduggineni/codesearchnet ### Annotations This datasets include a summary column including a short description of the function. #### Annotation process The annotation procedure was done using [Salesforce](https://huggingface.co/Salesforce) T5 summarization models. A sample notebook of the process can be found at https://github.com/Nan-Do/OpenAssistantInstructionResponsePython The annontations have been cleaned to make sure there are no repetitions and/or meaningless summaries. (some may still be present in the dataset) ### Licensing Information Apache 2.0
提供机构:
Nan-Do
原始信息汇总

数据集卡片 "code-search-net-javascript"

数据集描述

数据集概述

该数据集是CodeSearchNet的JavaScript部分,并添加了摘要列。CodeSearchNet数据集包括在GitHub上找到的开源函数及其注释。摘要是对函数功能的简短描述。

语言

数据集中的注释为英文,函数代码为JavaScript。

数据分割

数据集包括训练、测试和验证标签作为列。

数据集创建

创建时间

2023年5月

创建理由

该数据集可用于生成教学(或其他有趣)的数据集,这些数据集对于训练大型语言模型(LLMs)非常有用。

源数据

CodeSearchNet数据集可在Kaggle上找到。

注释

该数据集包括一个摘要列,包含对函数功能的简短描述。

注释过程

注释过程使用Salesforce的T5摘要模型完成。注释过程的示例笔记本可在GitHub上找到。注释已清理,以确保没有重复和/或无意义的摘要(数据集中可能仍存在一些)。

许可信息

Apache 2.0

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作