five

Fsoft-AIC/the-vault-class

收藏
Hugging Face2023-10-11 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/Fsoft-AIC/the-vault-class
下载链接
链接失效反馈
官方服务:
资源简介:
The Vault数据集是一个全面、大规模、多语言的并行数据集,包含从The Stack(最大的许可源代码数据集)中提取的高质量代码-文本对。该数据集提供了10种流行编程语言(如Java、JavaScript、Python、Ruby、Rust、Golang、C#、C++、C和PHP)的代码片段,并包含多个代码片段级别、元数据和11种文档字符串样式,以增强可用性和多功能性。
提供机构:
Fsoft-AIC
原始信息汇总

数据集概述

数据集描述

The Vault 数据集是一个全面、大规模、多语言的并行数据集,包含高质量的代码-文本对,源自 The Stack,这是最大的许可源代码数据集。

数据集摘要

The Vault 数据集包含来自 10 种流行编程语言(如 Java、JavaScript、Python、Ruby、Rust、Golang、C#、C++、C 和 PHP)的代码片段。该数据集提供了多个代码片段级别、元数据和 11 种文档字符串样式,以增强可用性和多功能性。

支持的任务

The Vault 可用于预训练大型语言模型或下游代码-文本交互任务。可以使用 The Vault 构建与代码理解和生成相关的多种任务,例如 代码摘要文本到代码生成代码搜索

语言

自然语言文本(文档字符串)为英语。

The Vault 支持 10 种编程语言:PythonJavaJavaScriptPHPCC#C++GoRubyRust

数据集结构

数据实例

json { "hexsha": "78b961a6673ec1e12f8d95c33ef081f75561a87c", "repo": "AIS-Bonn/sl-cutscenes", "path": "sl_cutscenes/object_models.py", "license": ["MIT"], "language": "Python", "identifier": "MeshLoader", "original_docstring": " Class to load the meshes for the objects in a scene. ", "docstring": "Class to load the meshes for the objects in a scene.", "docstring_tokens": ["Class", "to", "load", "the", "meshes", "for", "the", "objects", "in", "a", "scene", "."], "code": "class MeshLoader: """ Class to load the meshes for the objects in a scene. """

def __init__(self):
    """Module initializer"""
    self.base_dir = CONSTANTS.MESH_BASE_DIR
    self.text_dir = CONSTANTS.TEXT_BASE_DIR
    self.reset()

def reset(self):
    self.loaded_meshes = []

def get_meshes(self):
    """ """
    extract_singular = lambda x: x[0] if len(x) == 1 else x
    return [extract_singular(item) for item in self.loaded_meshes]

def load_meshes(self, obj_info: List[object_info.ObjectInfo], **kwargs):
    """
    Loads the meshes whose information is given in parameter obj_info.
    Each call of this method APPENDS a list to the loaded_meshes attribute.
    :param obj_info: The object information of the meshes to be loaded.
    :param kwargs: additional mesh modifiers such as scale, specified with a leading mod_
    """
    paths = []
    for obj in obj_info:
        path = self.text_dir if obj.name.endswith("_floor") or obj.name.endswith("_wall") else self.base_dir
        paths.append((path / obj.mesh_fp).resolve())
    scales = [obj.scale for obj in obj_info]
    class_ids = [obj.class_id for obj in obj_info]
    mod_scales = kwargs.get("mod_scale", [1.0] * len(scales))
    scales = [s * ms for (s, ms) in zip(scales, mod_scales)]
    flags = [mesh_flags(obj) for obj in obj_info]
    meshes = sl.Mesh.load_threaded(filenames=paths, flags=flags)

    # Setup class IDs
    for _, (mesh, scale, class_id) in enumerate(zip(meshes, scales, class_ids)):
        pt = torch.eye(4)
        pt[:3, :3] *= scale
        mesh.pretransform = pt
        mesh.class_index = class_id

    info_mesh_tuples = list(zip(obj_info, meshes))
    self.loaded_meshes.append(info_mesh_tuples)",
"code_tokens": ["class", "MeshLoader", ":", "def", "__init__", "(", "self", ")", ":", """"Module initializer"""", "self", ".", "base_dir", "=", "CONSTANTS", ".", "MESH_BASE_DIR", "self", ".", "text_dir", "=", "CONSTANTS", ".", "TEXT_BASE_DIR", "self", ".", "reset", "(", ")", "def", "reset", "(", "self", ")", ":", "self", ".", "loaded_meshes", "=", "[", "]", "def", "get_meshes", "(", "self", ")", ":", """" """", "extract_singular", "=", "lambda", "x", ":", "x", "[", "0", "]", "if", "len", "(", "x", ")", "==", "1", "else", "x", "return", "[", "extract_singular", "(", "item", ")", "for", "item", "in", "self", ".", "loaded_meshes", "]", "def", "load_meshes", "(", "self", ",", "obj_info", ":", "List", "[", "object_info", ".", "ObjectInfo", "]", ",", "**", "kwargs", ")", ":", """"
    Loads the meshes whose information is given in parameter obj_info.
    Each call of this method APPENDS a list to the loaded_meshes attribute.
    :param obj_info: The object information of the meshes to be loaded.
    :param kwargs: additional mesh modifiers such as scale, specified with a leading mod_
    """", "paths", "=", "[", "]", "for", "obj", "in", "obj_info", ":", "path", "=", "self", ".", "text_dir", "if", "obj", ".", "name", ".", "endswith", "(", ""_floor"", ")", "or", "obj", ".", "name", ".", "endswith", "(", ""_wall"", ")", "else", "self", ".", "base_dir", "paths", ".", "append", "(", "(", "path", "/", "obj", ".", "mesh_fp", ")", ".", "resolve", "(", ")", ")", "scales", "=", "[", "obj", ".", "scale", "for", "obj", "in", "obj_info", "]", "class_ids", "=", "[", "obj", ".", "class_id", "for", "obj", "in", "obj_info", "]", "mod_scales", "=", "kwargs", ".", "get", "(", ""mod_scale"", ",", "[", "1.0", "]", "*", "len", "(", "scales", ")", ")", "scales", "=", "[", "s", "*", "ms", "for", "(", "s", ",", "ms", ")", "in", "zip", "(", "scales", ",", "mod_scales", ")", "]", "flags", "=", "[", "mesh_flags", "(", "obj", ")", "for", "obj", "in", "obj_info", "]", "meshes", "=", "sl", ".", "Mesh", ".", "load_threaded", "(", "filenames", "=", "paths", ",", "flags", "=", "flags", ")", "for", "_", ",", "(", "mesh", ",", "scale", ",", "class_id", ")", "in", "enumerate", "(", "zip", "(", "meshes", ",", "scales", ",", "class_ids", ")", ")", ":", "pt", "=", "torch", ".", "eye", "(", "4", ")", "pt", "[", ":", "3", ",", ":", "3", "]", "*=", "scale", "mesh", ".", "pretransform", "=", "pt", "mesh", ".", "class_index", "=", "class_id", "info_mesh_tuples", "=", "list", "(", "zip", "(", "obj_info", ",", "meshes", ")", ")", "self", ".", "loaded_meshes", ".", "append", "(", "info_mesh_tuples", ")"],
"short_docstring": "Class to load the meshes for the objects in a scene.",
"short_docstring_tokens": ["Class", "to", "load", "the", "meshes", "for", "the", "objects", "in", "a", "scene", "."],
"comment": [""""
Class to load the meshes for the objects in a scene.
"""", """"Module initializer"""", """" """", """"
    Loads the meshes whose information is given in parameter obj_info.
    Each call of this method APPENDS a list to the loaded_meshes attribute.
    :param obj_info: The object information of the meshes to be loaded.
    :param kwargs: additional mesh modifiers such as scale, specified with a leading mod_
    """", "# Setup class IDs"],
"parameters": [],
"docstring_params": {"returns": [], "raises": [], "params": [], "outlier_params": [], "others": []}

}

数据字段

  • hexsha (string): 文件的唯一 git hash
  • repo (string): 所有者/仓库
  • path (string): 原始文件的完整路径
  • license (list): 仓库中的许可证
  • language (string): 编程语言
  • identifier (string): 函数或方法名称
  • original_string (string): 函数/类节点的原始版本
  • original_docstring (string): 标记化或解析前的原始字符串
  • code (string): 原始代码部分
  • code_tokens (list): code 的标记化版本
  • short_docstring (string): 简短的摘要(文档字符串的第一行)
  • short_docstring_tokens (list): short_docstring 的标记化版本
  • docstring (string): 顶级注释或文档字符串(不包含参数文档、返回、异常字段等的文档字符串版本)
  • docstring_tokens (list): 文档字符串的标记化版本
  • comment (list): 函数/类内部的注释列表
  • parameters (list): 参数及其类型列表(类型可以为 None)
  • docstring_params (dict): 从文档字符串解析的信息字典

数据分割

在此仓库中,类级别的数据未分割,仅包含在训练集中。

数据集统计

语言 样本数量
Python 422,187
Java 4,872,485
JavaScript 291,479
PHP 1,173,916
C# 1,437,800
C++ 174,370
Ruby 353,859
Rust 93,311
C -
Go -
TOTAL 9,121,300

使用方法

可以使用 datasets 库加载 The Vault 数据集: python from datasets import load_dataset

加载完整的类级别数据集

dataset = load_dataset("Fsoft-AIC/the-vault-class")

特定语言(例如 Python)

dataset = load_dataset("Fsoft-AIC/the-vault-class", languages=[Python])

数据集流式加载

data = load_dataset("Fsoft-AIC/the-vault-class", streaming=True) for sample in iter(data[train]): print(sample)

附加信息

许可信息

MIT 许可证

引用信息

@article{manh2023vault, title={The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation}, author={Manh, Dung Nguyen and Hai, Nam Le and Dau, Anh TV and Nguyen, Anh Minh and Nghiem, Khanh and Guo, Jin and Bui, Nghi DQ}, journal={arXiv preprint arXiv:2305.06156}, year={2023} }

贡献

该数据集由 FSOFT AI4Code 团队 开发。

搜集汇总
数据集介绍
main_image_url
构建方式
在代码智能领域,高质量的大规模平行语料库是推动代码理解与生成技术进步的基石。Fsoft-AIC/the-vault-class 数据集正是基于这一理念构建而成,它源自 The Stack——当前规模最大且许可最为宽松的开源代码语料库。该数据集通过精细的提取与清洗流程,从10种主流编程语言(包括Java、Python、C#、PHP、C++、Ruby、Rust、JavaScript、C和Go)中筛选出高质量的代码-文本对。针对类级别的代码片段,数据集保留了完整的元数据,并系统性地整理了11种文档字符串风格,以确保数据在语义和结构上的丰富性与一致性。
使用方法
使用该数据集的过程极为简便,研究者可通过HuggingFace的datasets库直接加载。只需执行pip install datasets安装依赖,随后调用load_dataset('Fsoft-AIC/the-vault-class')即可获取完整的类级别数据。若需聚焦特定语言,可在加载时指定languages参数,例如languages=['Python']。针对大规模数据处理场景,数据集还支持流式加载模式,通过设置streaming=True,开发者能够以迭代方式高效处理样本,避免内存瓶颈。此外,官方提供了Azure Blob存储备份下载链接,确保数据获取的可靠性与灵活性。
背景与挑战
背景概述
在自然语言处理与程序语言交织的研究前沿,代码理解与生成任务日益成为推动软件工程智能化发展的核心引擎。The Vault数据集由FPT Software人工智能实验室的FSOFT AI4Code团队于2023年创建,旨在弥合大规模多语言代码与自然语言描述之间的语义鸿沟。该数据集基于The Stack语料库,精心筛选并构建了涵盖Java、Python、JavaScript等10种主流编程语言的逾九百万高质量代码-文本对,为代码摘要、文本到代码生成及代码搜索等关键下游任务提供了标准化的训练与评估基准。其发布不仅丰富了跨语言代码智能的研究资源,更通过系统化的文档字符串风格与细粒度元数据,为领域内模型性能的鲁棒性评估奠定了坚实基础。
当前挑战
当前,The Vault数据集面临的核心挑战集中于领域问题与构建过程两个层面。在领域问题层面,尽管数据集覆盖多种语言,但不同编程语言在语法结构、注释惯例及文档字符串风格上的显著差异,对模型实现跨语言泛化与语义对齐构成严峻考验。此外,代码理解任务中函数级与类级表示粒度的选择,以及如何有效处理隐式类型、动态特性等语言特异性问题,仍是亟待突破的瓶颈。在构建过程中,从海量开源仓库中自动提取并清洗代码-文档对时,需应对文档字符串缺失、注释与代码逻辑不一致及冗余噪声等数据质量问题,同时确保许可合规性与隐私安全,这要求精细化的过滤策略与人工校验机制,从而保障数据集的高可靠性与可复现性。
常用场景
经典使用场景
The Vault数据集最为经典的使用场景在于代码摘要生成与文本到代码生成任务。研究者可利用其海量的高质量代码-文档串对,训练模型自动为给定代码片段生成自然语言描述,或根据自然语言描述生成对应代码。该数据集覆盖Java、Python、JavaScript等10种主流编程语言,并提供11种文档串风格,使得模型能够在多语言、多风格环境下进行泛化学习,极大提升了代码理解与生成任务的实用性与鲁棒性。
解决学术问题
该数据集有效解决了代码-文本对齐研究中长期存在的标注数据匮乏与语言覆盖不足问题。通过从The Stack中提取并清洗得到超过900万条高质量代码-文档串对,The Vault为跨语言代码语义理解、代码检索、以及多任务联合学习等学术难题提供了坚实的数据基础。其发布推动了代码智能领域从单语言向多语言、从粗粒度向细粒度的范式转变,显著促进了代码表示学习与生成模型的可迁移性研究。
实际应用
在实际应用中,The Vault数据集被广泛用于构建智能代码补全工具、自动化文档生成系统以及代码搜索引擎。例如,开发者可基于该数据集训练模型,在集成开发环境中实时生成函数级别的注释说明,或根据用户输入的自然语言查询精准匹配相关代码片段。此外,该数据集还支撑了低资源编程语言(如Rust、Ruby)的代码理解工具开发,有效降低了跨语言软件维护与知识迁移的门槛。
数据集最近研究
最新研究方向
在代码智能领域,大规模多语言代码-文本对齐数据集的研究已成为推动预训练语言模型发展的关键基石。The Vault 数据集作为一项前沿成果,从 The Stack 中精选出覆盖 Java、Python、C# 等 10 种主流编程语言的高质量代码-文档对,并创新性地提供类级与函数级多粒度数据以及 11 种文档字符串风格。这一设计精准回应了当前大语言模型在代码理解与生成任务中对语义对齐和跨语言泛化能力的迫切需求。随着诸如 GPT-4、CodeLlama 等模型在自动化编程、代码补全与智能调试等热点事件中展现惊人潜力,The Vault 凭借其结构化元数据和细粒度注释,为代码摘要、文本到代码生成及代码搜索等下游任务提供了标准化训练基准,显著降低了多语言适配的复杂性。其影响不仅在于加速了科研社区对代码语义建模的探索,更通过 MIT 许可协议促进了工业级应用的落地,成为连接前沿模型研究与实际开发效率提升的重要桥梁。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作