auphong2707/codegr-vault
收藏Hugging Face2026-04-27 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/auphong2707/codegr-vault
下载链接
链接失效反馈官方服务:
资源简介:
该数据集是一个多语言代码数据集,包含C++、Python和Ruby三种编程语言的代码片段。每个数据条目具有以下特征:hexsha(代码的哈希值)、repo(代码仓库来源)、path(文件路径)、identifier(标识符)、parameters(参数列表,包括参数名和类型)、language(编程语言)、numeric_id(数字ID)、semantic_id(语义ID)、structure_id(结构ID)、url_based_id(基于URL的ID)、text(代码文本)和row_type(行类型)。数据集被分割为训练集和测试集,其中训练集用于模型训练,测试集用于评估。具体规模如下:C++配置有170,776个训练示例和15,916个测试示例;Python配置有595,408个训练示例和14,742个测试示例;Ruby配置有63,703个训练示例和18,637个测试示例。该数据集适用于代码分析、自然语言处理任务,如代码生成、代码理解或机器翻译。
This dataset is a multilingual code dataset containing code snippets in three programming languages: C++, Python, and Ruby. Each data entry includes features such as hexsha (hash of the code), repo (source repository), path (file path), identifier (identifier), parameters (a list of parameters with name and type), language (programming language), numeric_id (numeric ID), semantic_id (semantic ID), structure_id (structure ID), url_based_id (URL-based ID), text (code text), and row_type (row type). The dataset is split into training and test sets, with the training set used for model training and the test set for evaluation. Specific sizes are as follows: the C++ configuration has 170,776 training examples and 15,916 test examples; the Python configuration has 595,408 training examples and 14,742 test examples; the Ruby configuration has 63,703 training examples and 18,637 test examples. This dataset is suitable for code analysis and natural language processing tasks, such as code generation, code understanding, or machine translation.
提供机构:
auphong2707



