Fsoft-AIC/the-vault-class
收藏数据集概述
数据集描述
The Vault 数据集是一个全面、大规模、多语言的并行数据集,包含高质量的代码-文本对,源自 The Stack,这是最大的许可源代码数据集。
数据集摘要
The Vault 数据集包含来自 10 种流行编程语言(如 Java、JavaScript、Python、Ruby、Rust、Golang、C#、C++、C 和 PHP)的代码片段。该数据集提供了多个代码片段级别、元数据和 11 种文档字符串样式,以增强可用性和多功能性。
支持的任务
The Vault 可用于预训练大型语言模型或下游代码-文本交互任务。可以使用 The Vault 构建与代码理解和生成相关的多种任务,例如 代码摘要、文本到代码生成 和 代码搜索。
语言
自然语言文本(文档字符串)为英语。
The Vault 支持 10 种编程语言:Python、Java、JavaScript、PHP、C、C#、C++、Go、Ruby、Rust
数据集结构
数据实例
json { "hexsha": "78b961a6673ec1e12f8d95c33ef081f75561a87c", "repo": "AIS-Bonn/sl-cutscenes", "path": "sl_cutscenes/object_models.py", "license": ["MIT"], "language": "Python", "identifier": "MeshLoader", "original_docstring": " Class to load the meshes for the objects in a scene. ", "docstring": "Class to load the meshes for the objects in a scene.", "docstring_tokens": ["Class", "to", "load", "the", "meshes", "for", "the", "objects", "in", "a", "scene", "."], "code": "class MeshLoader: """ Class to load the meshes for the objects in a scene. """
def __init__(self):
"""Module initializer"""
self.base_dir = CONSTANTS.MESH_BASE_DIR
self.text_dir = CONSTANTS.TEXT_BASE_DIR
self.reset()
def reset(self):
self.loaded_meshes = []
def get_meshes(self):
""" """
extract_singular = lambda x: x[0] if len(x) == 1 else x
return [extract_singular(item) for item in self.loaded_meshes]
def load_meshes(self, obj_info: List[object_info.ObjectInfo], **kwargs):
"""
Loads the meshes whose information is given in parameter obj_info.
Each call of this method APPENDS a list to the loaded_meshes attribute.
:param obj_info: The object information of the meshes to be loaded.
:param kwargs: additional mesh modifiers such as scale, specified with a leading mod_
"""
paths = []
for obj in obj_info:
path = self.text_dir if obj.name.endswith("_floor") or obj.name.endswith("_wall") else self.base_dir
paths.append((path / obj.mesh_fp).resolve())
scales = [obj.scale for obj in obj_info]
class_ids = [obj.class_id for obj in obj_info]
mod_scales = kwargs.get("mod_scale", [1.0] * len(scales))
scales = [s * ms for (s, ms) in zip(scales, mod_scales)]
flags = [mesh_flags(obj) for obj in obj_info]
meshes = sl.Mesh.load_threaded(filenames=paths, flags=flags)
# Setup class IDs
for _, (mesh, scale, class_id) in enumerate(zip(meshes, scales, class_ids)):
pt = torch.eye(4)
pt[:3, :3] *= scale
mesh.pretransform = pt
mesh.class_index = class_id
info_mesh_tuples = list(zip(obj_info, meshes))
self.loaded_meshes.append(info_mesh_tuples)",
"code_tokens": ["class", "MeshLoader", ":", "def", "__init__", "(", "self", ")", ":", """"Module initializer"""", "self", ".", "base_dir", "=", "CONSTANTS", ".", "MESH_BASE_DIR", "self", ".", "text_dir", "=", "CONSTANTS", ".", "TEXT_BASE_DIR", "self", ".", "reset", "(", ")", "def", "reset", "(", "self", ")", ":", "self", ".", "loaded_meshes", "=", "[", "]", "def", "get_meshes", "(", "self", ")", ":", """" """", "extract_singular", "=", "lambda", "x", ":", "x", "[", "0", "]", "if", "len", "(", "x", ")", "==", "1", "else", "x", "return", "[", "extract_singular", "(", "item", ")", "for", "item", "in", "self", ".", "loaded_meshes", "]", "def", "load_meshes", "(", "self", ",", "obj_info", ":", "List", "[", "object_info", ".", "ObjectInfo", "]", ",", "**", "kwargs", ")", ":", """"
Loads the meshes whose information is given in parameter obj_info.
Each call of this method APPENDS a list to the loaded_meshes attribute.
:param obj_info: The object information of the meshes to be loaded.
:param kwargs: additional mesh modifiers such as scale, specified with a leading mod_
"""", "paths", "=", "[", "]", "for", "obj", "in", "obj_info", ":", "path", "=", "self", ".", "text_dir", "if", "obj", ".", "name", ".", "endswith", "(", ""_floor"", ")", "or", "obj", ".", "name", ".", "endswith", "(", ""_wall"", ")", "else", "self", ".", "base_dir", "paths", ".", "append", "(", "(", "path", "/", "obj", ".", "mesh_fp", ")", ".", "resolve", "(", ")", ")", "scales", "=", "[", "obj", ".", "scale", "for", "obj", "in", "obj_info", "]", "class_ids", "=", "[", "obj", ".", "class_id", "for", "obj", "in", "obj_info", "]", "mod_scales", "=", "kwargs", ".", "get", "(", ""mod_scale"", ",", "[", "1.0", "]", "*", "len", "(", "scales", ")", ")", "scales", "=", "[", "s", "*", "ms", "for", "(", "s", ",", "ms", ")", "in", "zip", "(", "scales", ",", "mod_scales", ")", "]", "flags", "=", "[", "mesh_flags", "(", "obj", ")", "for", "obj", "in", "obj_info", "]", "meshes", "=", "sl", ".", "Mesh", ".", "load_threaded", "(", "filenames", "=", "paths", ",", "flags", "=", "flags", ")", "for", "_", ",", "(", "mesh", ",", "scale", ",", "class_id", ")", "in", "enumerate", "(", "zip", "(", "meshes", ",", "scales", ",", "class_ids", ")", ")", ":", "pt", "=", "torch", ".", "eye", "(", "4", ")", "pt", "[", ":", "3", ",", ":", "3", "]", "*=", "scale", "mesh", ".", "pretransform", "=", "pt", "mesh", ".", "class_index", "=", "class_id", "info_mesh_tuples", "=", "list", "(", "zip", "(", "obj_info", ",", "meshes", ")", ")", "self", ".", "loaded_meshes", ".", "append", "(", "info_mesh_tuples", ")"],
"short_docstring": "Class to load the meshes for the objects in a scene.",
"short_docstring_tokens": ["Class", "to", "load", "the", "meshes", "for", "the", "objects", "in", "a", "scene", "."],
"comment": [""""
Class to load the meshes for the objects in a scene.
"""", """"Module initializer"""", """" """", """"
Loads the meshes whose information is given in parameter obj_info.
Each call of this method APPENDS a list to the loaded_meshes attribute.
:param obj_info: The object information of the meshes to be loaded.
:param kwargs: additional mesh modifiers such as scale, specified with a leading mod_
"""", "# Setup class IDs"],
"parameters": [],
"docstring_params": {"returns": [], "raises": [], "params": [], "outlier_params": [], "others": []}
}
数据字段
- hexsha (string): 文件的唯一 git hash
- repo (string): 所有者/仓库
- path (string): 原始文件的完整路径
- license (list): 仓库中的许可证
- language (string): 编程语言
- identifier (string): 函数或方法名称
- original_string (string): 函数/类节点的原始版本
- original_docstring (string): 标记化或解析前的原始字符串
- code (string): 原始代码部分
- code_tokens (list):
code的标记化版本 - short_docstring (string): 简短的摘要(文档字符串的第一行)
- short_docstring_tokens (list):
short_docstring的标记化版本 - docstring (string): 顶级注释或文档字符串(不包含参数文档、返回、异常字段等的文档字符串版本)
- docstring_tokens (list): 文档字符串的标记化版本
- comment (list): 函数/类内部的注释列表
- parameters (list): 参数及其类型列表(类型可以为 None)
- docstring_params (dict): 从文档字符串解析的信息字典
数据分割
在此仓库中,类级别的数据未分割,仅包含在训练集中。
数据集统计
| 语言 | 样本数量 |
|---|---|
| Python | 422,187 |
| Java | 4,872,485 |
| JavaScript | 291,479 |
| PHP | 1,173,916 |
| C# | 1,437,800 |
| C++ | 174,370 |
| Ruby | 353,859 |
| Rust | 93,311 |
| C | - |
| Go | - |
| TOTAL | 9,121,300 |
使用方法
可以使用 datasets 库加载 The Vault 数据集:
python
from datasets import load_dataset
加载完整的类级别数据集
dataset = load_dataset("Fsoft-AIC/the-vault-class")
特定语言(例如 Python)
dataset = load_dataset("Fsoft-AIC/the-vault-class", languages=[Python])
数据集流式加载
data = load_dataset("Fsoft-AIC/the-vault-class", streaming=True) for sample in iter(data[train]): print(sample)
附加信息
许可信息
MIT 许可证
引用信息
@article{manh2023vault, title={The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation}, author={Manh, Dung Nguyen and Hai, Nam Le and Dau, Anh TV and Nguyen, Anh Minh and Nghiem, Khanh and Guo, Jin and Bui, Nghi DQ}, journal={arXiv preprint arXiv:2305.06156}, year={2023} }
贡献
该数据集由 FSOFT AI4Code 团队 开发。




