five

The Vault

收藏
arXiv2023-10-30 更新2024-06-21 收录
下载链接:
https://github.com/FSoft-AI4Code/TheVault
下载链接
链接失效反馈
官方服务:
资源简介:
The Vault是一个包含4300万个高质量代码-文本对的多语言数据集,用于训练大型语言模型以理解和生成代码。该数据集通过结合规则和深度学习方法,从原始源代码中提取样本,确保代码和文本之间的高质量配对。数据集涵盖10种流行编程语言,包括Java、JavaScript、Python、Ruby、Rust、Golang、C#、C++、C和PHP,比CodeSearchNet更丰富多样。创建过程中,开发了用于构建和质量控制代码-文本对的工具,并通过公共GitHub仓库向开放社区发布。数据集在代码生成、搜索和摘要等任务中,通过微调大型语言模型,显示出优于其他数据集的性能,特别是在代码生成任务中,通过pass@k评估,显著优于HumanEval和MBPP数据集。

The Vault is a multilingual dataset containing 43 million high-quality code-text pairs, designed for training large language models (LLMs) to comprehend and generate code. This dataset extracts samples from raw source code by combining rule-based and deep learning methods, ensuring high-quality alignment between code and its corresponding natural language descriptions. It covers 10 popular programming languages, including Java, JavaScript, Python, Ruby, Rust, Golang, C#, C++, C, and PHP, making it more comprehensive and diverse than CodeSearchNet. During its curation, tools for constructing and quality-assuring code-text pairs were developed, and the dataset was released to the open-source community via public GitHub repositories. When used to fine-tune LLMs for tasks including code generation, code search, and code summarization, this dataset demonstrates superior performance relative to other comparable datasets. Particularly in code generation tasks evaluated using the pass@k metric, it outperforms both the HumanEval and MBPP datasets by a significant margin.
提供机构:
FPT软件人工智能中心
创建时间:
2023-05-09
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作