Fsoft-AIC/the-vault-function
收藏Hugging Face2024-10-15 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/Fsoft-AIC/the-vault-function
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- code
- en
multilinguality:
- multiprogramming languages
task_categories:
- text-generation
license: mit
dataset_info:
features:
- name: identifier
dtype: string
- name: return_type
dtype: string
- name: repo
dtype: string
- name: path
dtype: string
- name: language
dtype: string
- name: code
dtype: string
- name: code_tokens
dtype: string
- name: original_docstring
dtype: string
- name: comment
dtype: string
- name: docstring_tokens
dtype: string
- name: docstring
dtype: string
- name: original_string
dtype: string
pretty_name: The Vault Function
viewer: true
---
## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks](#supported-tasks)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Statistics](#dataset-statistics)
- [Usage](#usage)
- [Additional Information](#additional-information)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Repository:** [FSoft-AI4Code/TheVault](https://github.com/FSoft-AI4Code/TheVault)
- **Paper:** [The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation](https://arxiv.org/abs/2305.06156)
- **Contact:** support.ailab@fpt.com
- **Website:** https://www.fpt-aicenter.com/ai-residency/
<p align="center">
<img src="https://raw.githubusercontent.com/FSoft-AI4Code/TheVault/main/assets/the-vault-4-logo-png.png" width="300px" alt="logo">
</p>
<div align="center">
# The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation
</div>
## Dataset Summary
The Vault dataset is a comprehensive, large-scale, multilingual parallel dataset that features high-quality code-text pairs derived from The Stack, the largest permissively-licensed source code dataset.
We provide The Vault which contains code snippets from 10 popular programming languages such as Java, JavaScript, Python, Ruby, Rust, Golang, C#, C++, C, and PHP. This dataset provides multiple code-snippet levels, metadata, and 11 docstring styles for enhanced usability and versatility.
## Supported Tasks
The Vault can be used for pretraining LLMs or downstream code-text interaction tasks. A number of tasks related to code understanding and geneartion can be constructed using The Vault such as *code summarization*, *text-to-code generation* and *code search*.
## Languages
The natural language text (docstring) is in English.
10 programming languages are supported in The Vault: `Python`, `Java`, `JavaScript`, `PHP`, `C`, `C#`, `C++`, `Go`, `Ruby`, `Rust`
## Dataset Structure
### Data Instances
```
{
"hexsha": "5c47f0b4c173a8fd03e4e633d9b3dd8211e67ad0",
"repo": "neumanna94/beepboop",
"path": "js/scripts.js",
"license": [
"MIT"
],
"language": "JavaScript",
"identifier": "beepBoopSelector",
"return_type": "<not_specific>",
"original_string": "function beepBoopSelector(inputString, bbFunction){\n if(bbFunction==1){\n return beepBoop(inputString);\n } else if(bbFunction==2){\n return beepBoop2(inputString);\n } else if(bbFunction==3){\n return beepBoop3(inputString);\n } else {\n }\n}",
"original_docstring": "//Determines what beepBoop function to use",
"docstring": "Determines what beepBoop function to use",
"docstring_tokens": [
"Determines",
"what",
"beepBoop",
"function",
"to",
"use"
],
"code": "function beepBoopSelector(inputString, bbFunction){\n if(bbFunction==1){\n return beepBoop(inputString);\n } else if(bbFunction==2){\n return beepBoop2(inputString);\n } else if(bbFunction==3){\n return beepBoop3(inputString);\n } else {\n }\n}",
"code_tokens": [
"function",
"beepBoopSelector",
"(",
"inputString",
",",
"bbFunction",
")",
"{",
"if",
"(",
"bbFunction",
"==",
"1",
")",
"{",
"return",
"beepBoop",
"(",
"inputString",
")",
";",
"}",
"else",
"if",
"(",
"bbFunction",
"==",
"2",
")",
"{",
"return",
"beepBoop2",
"(",
"inputString",
")",
";",
"}",
"else",
"if",
"(",
"bbFunction",
"==",
"3",
")",
"{",
"return",
"beepBoop3",
"(",
"inputString",
")",
";",
"}",
"else",
"{",
"}",
"}"
],
"short_docstring": "Determines what beepBoop function to use",
"short_docstring_tokens": [
"Determines",
"what",
"beepBoop",
"function",
"to",
"use"
],
"comment": [],
"parameters": [
{
"param": "inputString",
"type": null
},
{
"param": "bbFunction",
"type": null
}
],
"docstring_params": {
"returns": [],
"raises": [],
"params": [
{
"identifier": "inputString",
"type": null,
"docstring": null,
"docstring_tokens": [],
"default": null,
"is_optional": null
},
{
"identifier": "bbFunction",
"type": null,
"docstring": null,
"docstring_tokens": [],
"default": null,
"is_optional": null
}
],
"outlier_params": [],
"others": []
}
}
```
### Data Fields
Data fields for function level:
- **hexsha** (string): the unique git hash of file
- **repo** (string): the owner/repo
- **path** (string): the full path to the original file
- **license** (list): licenses in the repo
- **language** (string): the programming language
- **identifier** (string): the function or method name
- **return_type** (string): the type returned by the function
- **original_string** (string): original version of function/class node
- **original_docstring** (string): the raw string before tokenization or parsing
- **code** (string): the part of the original that is code
- **code_tokens** (list): tokenized version of `code`
- **short_docstring** (string): short, brief summarization (first line of the docstring)
- **short_docstring_tokens** (list): tokenized version of `short_docstring
- **docstring** (string): the top-level comment or docstring (docstring version without param’s doc, return, exception fields, etc)
- **docstring_tokens** (list): tokenized version of docstring
- **comment** (list): list of comments (line) inside the function/class
- **parameters** (list): List of parameters and its type (type can be None)
- **docstring_params** (dict): Dictionary of the parsed information from docstring
See [here](https://github.com/FSoft-AI4Code/TheVault/blob/main/data/README.md) for more details and examples.
### Data Splits
In this repo, The Vault is divided into 5 subsets, where three training versions are split based on size of the full training set, and the remains are validation set and test set (approximate 20,000 samples in each). The statistic for languages in each split set is illustrated in the following section.
Before split, the dataset is deduplicated. There are 3 versions of training set that are small (5%), medium (20%) and large (100%).
## Dataset Statistics
- Compare to other benchmarks
| Dataset | #Language | #Code-text pair |
|:--------------------------|----------:|-----------------:|
| PyMT5 | 1 | ≈ 7,700,000 |
| CoDesc | 1 | 4,211,516 |
| CodeSearchNet | 6 | 2,326,976 |
| CodeSearchNet (CodeXGLUE) | 6 | 1,005,474 |
| Deepcom | 1 | 424,028 |
| CONCODE | 1 | 2,184,310 |
| Funcom | 1 | 2,149,121 |
| CodeT5 | 8 | 3,158,313 |
| **The Vault** | **10** | **34,098,775** |
- Statistic for split sets
| | train/small | train/medium | train/full | validation | test | total |
|:-----------|------------:|-------------:|-----------:|-----------:|-------:|--------------:|
|Python | 370,657 | 1,952,110 | 7,772,647 | 30,992 | 21,652 | 7,825,291 |
|Java | 351,213 | 1,612,366 | 6,629,193 | 22,677 | 15,552 | 6,667,422 |
|JavaScript | 82,931 | 404,729 | 1,640,416 | 22,044 | 21,108 | 1,683,568 |
|PHP | 236,638 | 1,155,476 | 4,656,371 | 21,375 | 19,010 | 4,696,756 |
|C | 105,978 | 381,207 | 1,639,319 | 27,525 | 19,122 | 1,685,966 |
|C# | 141,090 | 783,166 | 3,305,891 | 24,787 | 19,638 | 3,350,316 |
|C++ | 87,420 | 410,907 | 1,671,268 | 20,011 | 18,169 | 1,709,448 |
|Go | 267,535 | 1,319,547 | 5,109,020 | 19,102 | 25,314 | 5,153,436 |
|Ruby | 23,921 | 112,574 | 424,339 | 17,338 | 19,908 | 461,585 |
|Rust | 35,367 | 224,015 | 825,130 | 16,716 | 23,141 | 864,987 |
|TOTAL | 1,702,750 | 8,356,097 |33,673,594 |222,567 |202,614 |**34,098,775** |
## Usage
You can load The Vault dataset using datasets library: ```pip install datasets```
```python
from datasets import load_dataset
# Load full function level dataset (34M samples)
dataset = load_dataset("Fsoft-AIC/the-vault-function")
# Load function level train/validation/test set
dataset = load_dataset("Fsoft-AIC/the-vault-function", split_set=["train"])
# Load "small" (or "medium", "full") version of function level training set
dataset = load_dataset("Fsoft-AIC/the-vault-function", split_set=["train/small"])
# specific language (e.g. Python)
dataset = load_dataset("Fsoft-AIC/the-vault-function", split_set=["train"], languages=['python'])
# dataset streaming
data = load_dataset("Fsoft-AIC/the-vault-function", split_set= ["train"], streaming= True)
for sample in iter(data['train']):
print(sample)
```
A back up dataset can be downloaded in azure storage. See [Download The Vault from Azure blob storage](https://github.com/FSoft-AI4Code/TheVault#download-via-link).
## Additional information
### Licensing Information
MIT License
### Citation Information
```
@article{manh2023vault,
title={The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation},
author={Manh, Dung Nguyen and Hai, Nam Le and Dau, Anh TV and Nguyen, Anh Minh and Nghiem, Khanh and Guo, Jin and Bui, Nghi DQ},
journal={arXiv preprint arXiv:2305.06156},
year={2023}
}
```
### Contributions
This dataset is developed by [FSOFT AI4Code team](https://github.com/FSoft-AI4Code).
提供机构:
Fsoft-AIC
原始信息汇总
数据集概述
数据集名称
- 名称: The Vault Function
- 别名: The Vault
数据集描述
- 概述: The Vault是一个综合性的多语言数据集,用于推进代码理解和生成。它包含了从The Stack数据集中提取的高质量代码-文本对。
- 特点: 包含10种流行编程语言的代码片段,提供多种代码片段级别、元数据和11种文档字符串样式。
支持的任务
- 任务类型: 代码理解与生成
- 具体任务: 代码摘要生成、文本到代码生成、代码搜索等
支持的语言
- 编程语言: Python, Java, JavaScript, PHP, C, C#, C++, Go, Ruby, Rust
- 自然语言: 英语
数据集结构
- 数据实例: 每个实例包含代码片段及其相关元数据和文档字符串。
- 数据字段:
- hexsha: 文件的唯一git哈希
- repo: 仓库所有者/名称
- path: 原始文件的完整路径
- license: 仓库的许可证列表
- language: 编程语言
- identifier: 函数或方法名
- return_type: 函数返回类型
- original_string: 函数/类的原始版本
- original_docstring: 文档字符串的原始字符串
- code: 代码部分
- code_tokens: 代码的标记化版本
- short_docstring: 简短的文档字符串摘要
- short_docstring_tokens: 简短文档字符串的标记化版本
- docstring: 顶级注释或文档字符串
- docstring_tokens: 文档字符串的标记化版本
- comment: 函数/类内的注释列表
- parameters: 参数及其类型的列表
- docstring_params: 从文档字符串解析的信息字典
数据集统计
- 总计样本数: 34,098,775
- 训练集版本: 小(5%), 中(20%), 全(100%)
- 验证集和测试集样本数: 约20,000样本/集
使用方法
- 加载数据集: 使用
datasets库加载数据集,支持加载全量数据集、特定分割或特定语言的数据集。
许可证信息
- 许可证: MIT License
引用信息
@article{manh2023vault, title={The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation}, author={Manh, Dung Nguyen and Hai, Nam Le and Dau, Anh TV and Nguyen, Anh Minh and Nghiem, Khanh and Guo, Jin and Bui, Nghi DQ}, journal={arXiv preprint arXiv:2305.06156}, year={2023} }
贡献者
- 开发团队: FSOFT AI4Code团队



