OGB/ogbg-code2

Hugging Face2023-02-07 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/OGB/ogbg-code2

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit task_categories: - graph-ml --- # Dataset Card for ogbg-code2 ## Table of Contents - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [External Use](#external-use) - [PyGeometric](#pygeometric) - [Dataset Structure](#dataset-structure) - [Data Properties](#data-properties) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Additional Information](#additional-information) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **[Homepage](https://ogb.stanford.edu/docs/graphprop/#ogbg-code2)** - **[Repository](https://github.com/snap-stanford/ogb):**: - **Paper:**: Open Graph Benchmark: Datasets for Machine Learning on Graphs (see citation) - **Leaderboard:**: [OGB leaderboard](https://ogb.stanford.edu/docs/leader_graphprop/#ogbg-code2) and [Papers with code leaderboard](https://paperswithcode.com/sota/graph-property-prediction-on-ogbg-code2) ### Dataset Summary The `ogbg-code2` dataset contains Abstract Syntax Trees (ASTs) obtained from 450 thousands Python method definitions, from GitHub CodeSearchNet. "Methods are extracted from a total of 13,587 different repositories across the most popular projects on GitHub.", by teams at Stanford, to be a part of the Open Graph Benchmark. See their website or paper for dataset postprocessing. ### Supported Tasks and Leaderboards "The task is to predict the sub-tokens forming the method name, given the Python method body represented by AST and its node features. This task is often referred to as “code summarization”, because the model is trained to find succinct and precise description for a complete logical unit." The score is the F1 score of sub-token prediction. ## External Use ### PyGeometric To load in PyGeometric, do the following: ```python from datasets import load_dataset from torch_geometric.data import Data from torch_geometric.loader import DataLoader graphs_dataset = load_dataset("graphs-datasets/ogbg-code2) # For the train set (replace by valid or test as needed) graphs_list = [Data(graph) for graph in graphs_dataset["train"]] graphs_pygeometric = DataLoader(graph_list) ``` ## Dataset Structure ### Data Properties | property | value | |---|---| | scale | medium | | #graphs | 452,741 | | average #nodes | 125.2 | | average #edges | 124.2 | | average node degree | 2.0 | | average cluster coefficient | 0.0 | | MaxSCC ratio | 1.000 | | graph diameter | 13.5 | ### Data Fields Each row of a given file is a graph, with: - `edge_index` (list: 2 x #edges): pairs of nodes constituting edges - `edge_feat` (list: #edges x #edge-features): features of edges - `node_feat` (list: #nodes x #node-features): the nodes features, embedded - `node_feat_expanded` (list: #nodes x #node-features): the nodes features, as code - `node_is_attributed` (list: 1 x #nodes): ? - `node_dfs_order` (list: #nodes x #1): the nodes order in the abstract tree, if parsed using a depth first search - `node_depth` (list: #nodes x #1): the nodes depth in the abstract tree - `y` (list: 1 x #tokens): contains the tokens to predict as method name - `num_nodes` (int): number of nodes of the graph - `ptr` (list: 2): index of first and last node of the graph - `batch` (list: 1 x #nodes): ? ### Data Splits This data comes from the PyGeometric version of the dataset provided by OGB, and follows the provided data splits. This information can be found back using ```python from ogb.graphproppred import PygGraphPropPredDataset dataset = PygGraphPropPredDataset(name = 'ogbg-code2') split_idx = dataset.get_idx_split() train = dataset[split_idx['train']] # valid, test ``` More information (`node_feat_expanded`) has been added through the typeidx2type and attridx2attr csv files of the repo. ## Additional Information ### Licensing Information The dataset has been released under MIT license license. ### Citation Information ``` @inproceedings{hu-etal-2020-open, author = {Weihua Hu and Matthias Fey and Marinka Zitnik and Yuxiao Dong and Hongyu Ren and Bowen Liu and Michele Catasta and Jure Leskovec}, editor = {Hugo Larochelle and Marc Aurelio Ranzato and Raia Hadsell and Maria{-}Florina Balcan and Hsuan{-}Tien Lin}, title = {Open Graph Benchmark: Datasets for Machine Learning on Graphs}, booktitle = {Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual}, year = {2020}, url = {https://proceedings.neurips.cc/paper/2020/hash/fb60d411a5c5b72b2e7d3527cfc84fd0-Abstract.html}, } ``` ### Contributions Thanks to [@clefourrier](https://github.com/clefourrier) for adding this dataset.

license: MIT许可证 task_categories: - 图机器学习（graph-ml） # ogbg-code2 数据集卡片 ## 目录 - [目录](#table-of-contents) - [数据集描述](#dataset-description) - [数据集概况](#dataset-summary) - [支持任务与评测基准](#supported-tasks-and-leaderboards) - [外部使用](#external-use) - [PyGeometric](#pygeometric) - [数据集结构](#dataset-structure) - [数据属性](#data-properties) - [数据字段](#data-fields) - [数据划分](#data-splits) - [附加信息](#additional-information) - [许可证信息](#licensing-information) - [引用信息](#citation-information) - [贡献者](#contributions) ## 数据集描述 - **[主页](https://ogb.stanford.edu/docs/graphprop/#ogbg-code2)** - **[代码仓库](https://github.com/snap-stanford/ogb):** - **论文:** 《Open Graph Benchmark: 面向图机器学习的数据集》（详见引用信息） - **评测基准:** [OGB评测基准](https://ogb.stanford.edu/docs/leader_graphprop/#ogbg-code2) 与 [Papers with Code评测基准](https://paperswithcode.com/sota/graph-property-prediction-on-ogbg-code2) ### 数据集概况 `ogbg-code2` 数据集包含源自GitHub CodeSearchNet的452,741个Python方法定义所对应的抽象语法树（Abstract Syntax Tree，AST）。该数据集由斯坦福团队构建，作为开放图基准（Open Graph Benchmark，OGB）的组成部分，其样本提取自GitHub上13,587个不同的热门项目仓库。有关数据集的后处理细节，请参阅其官方网站或论文。 ### 支持任务与评测基准本任务的目标为：给定以抽象语法树及其节点特征表示的Python方法体，预测构成方法名的子Token（Token）。该任务常被称为“代码摘要生成”，因为模型需要学习为完整逻辑单元生成简洁精准的描述文本。评测指标为子Token预测任务的F1值。 ## 外部使用 ### PyGeometric使用示例 python from datasets import load_dataset from torch_geometric.data import Data from torch_geometric.loader import DataLoader graphs_dataset = load_dataset("graphs-datasets/ogbg-code2") # 针对训练集（可按需替换为验证集或测试集） graphs_list = [Data(graph) for graph in graphs_dataset["train"]] graphs_pygeometric = DataLoader(graph_list) ## 数据集结构 ### 数据属性 | 属性 | 取值 | |---|---| | 数据规模 | 中等 | | 图总数 | 452,741 | | 平均节点数 | 125.2 | | 平均边数 | 124.2 | | 平均节点度 | 2.0 | | 平均聚类系数 | 0.0 | | 最大强连通分量占比 | 1.000 | | 图直径 | 13.5 | ### 数据字段每个文件的每一行对应一张图，包含以下字段： - `edge_index` (list: 2 x #edges): 构成边的节点对 - `edge_feat` (list: #edges x #edge-features): 边的特征向量 - `node_feat` (list: #nodes x #node-features): 经嵌入处理的节点特征 - `node_feat_expanded` (list: #nodes x #node-features): 以代码形式表示的节点特征 - `node_is_attributed` (list: 1 x #nodes): 该字段具体含义暂未说明 - `node_dfs_order` (list: #nodes x #1): 采用深度优先搜索遍历的节点顺序 - `node_depth` (list: #nodes x #1): 节点在抽象语法树中的深度 - `y` (list: 1 x #tokens): 待预测的方法名对应的Token序列 - `num_nodes` (int): 该图的节点总数 - `ptr` (list: 2): 该图的首个与最后一个节点的索引 - `batch` (list: 1 x #nodes): 该字段具体含义暂未说明 ### 数据划分本数据集采用OGB官方提供的PyGeometric版本，并沿用其预设的数据划分方式。可通过以下代码获取划分索引： python from ogb.graphproppred import PygGraphPropPredDataset dataset = PygGraphPropPredDataset(name='ogbg-code2') split_idx = dataset.get_idx_split() train = dataset[split_idx["train"]] # 可替换为valid或test获取对应划分可通过仓库中的`typeidx2type.csv`与`attridx2attr.csv`文件获取`node_feat_expanded`字段的更多补充信息。 ## 附加信息 ### 许可证信息本数据集采用MIT许可证进行开源分发。 ### 引用信息 @inproceedings{hu-etal-2020-open, author = {Weihua Hu and Matthias Fey and Marinka Zitnik and Yuxiao Dong and Hongyu Ren and Bowen Liu and Michele Catasta and Jure Leskovec}, editor = {Hugo Larochelle and Marc Aurelio Ranzato and Raia Hadsell and Maria{-}Florina Balcan and Hsuan{-}Tien Lin}, title = {Open Graph Benchmark: Datasets for Machine Learning on Graphs}, booktitle = {Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual}, year = {2020}, url = {https://proceedings.neurips.cc/paper/2020/hash/fb60d411a5c5b72b2e7d3527cfc84fd0-Abstract.html}, } ### 贡献者感谢[@clefourrier](https://github.com/clefourrier) 贡献本数据集。

提供机构：

OGB

原始信息汇总

数据集概述

数据集名称

ogbg-code2

数据集来源

由Stanford团队从GitHub CodeSearchNet中提取的450,000个Python方法定义的抽象语法树（ASTs）组成。

数据集用途

用于机器学习任务，特别是图机器学习（graph-ml）。

数据集任务

预测构成方法名称的子令牌，给定由AST表示的Python方法主体及其节点特征。此任务通常被称为“代码摘要”。

评估指标

F1分数的子令牌预测。

数据集结构

数据属性

规模: 中等
图数量: 452,741
平均节点数: 125.2
平均边数: 124.2
平均节点度: 2.0
平均聚类系数: 0.0
最大强连通分量比率: 1.000
图直径: 13.5

数据字段

edge_index: 边索引
edge_feat: 边特征
node_feat: 节点特征
node_feat_expanded: 扩展的节点特征
node_is_attributed: 节点属性标志
node_dfs_order: 节点深度优先搜索顺序
node_depth: 节点深度
y: 预测的令牌
num_nodes: 节点数量
ptr: 节点范围指针
batch: 批处理信息

数据分割

遵循PyGeometric提供的数据分割。

许可证信息

MIT许可证

引用信息

@inproceedings{hu-etal-2020-open, author = {Weihua Hu and Matthias Fey and Marinka Zitnik and Yuxiao Dong and Hongyu Ren and Bowen Liu and Michele Catasta and Jure Leskovec}, editor = {Hugo Larochelle and Marc Aurelio Ranzato and Raia Hadsell and Maria{-}Florina Balcan and Hsuan{-}Tien Lin}, title = {Open Graph Benchmark: Datasets for Machine Learning on Graphs}, booktitle = {Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual}, year = {2020}, url = {https://proceedings.neurips.cc/paper/2020/hash/fb60d411a5c5b72b2e7d3527cfc84fd0-Abstract.html}, }

贡献者

@clefourrier

搜集汇总

数据集介绍

构建方式

在代码智能分析领域，ogbg-code2数据集通过系统化方法构建而成。该数据集源自GitHub CodeSearchNet平台，从超过45万个Python方法定义中提取抽象语法树（AST）。这些方法覆盖了13,587个不同代码仓库，均选自GitHub上最受欢迎的开源项目。数据经过斯坦福大学团队精心处理，确保AST结构完整且节点特征丰富，最终整合至开放图基准（Open Graph Benchmark）体系，为图机器学习任务提供标准化资源。

特点

ogbg-code2数据集展现出鲜明的结构特性与任务导向。数据集包含452,741个图结构，平均节点数为125.2，平均边数为124.2，呈现中等规模且拓扑稀疏的特征。每个图对应一个Python方法体，其节点特征涵盖嵌入表示与原始代码形式，边特征则刻画语法关系。任务目标聚焦于代码摘要，即根据AST预测方法名的子词序列，评估指标采用子词预测的F1分数，为代码生成与理解研究提供量化基准。

使用方法

该数据集可通过多种技术框架便捷加载与使用。在PyGeometric环境中，用户借助datasets库导入数据，并转换为图数据加载器进行处理。数据集遵循原始划分标准，训练、验证与测试集索引可通过ogb.graphproppred模块获取。节点扩展特征通过配套的映射文件增强，支持研究者开展图神经网络训练与评估，推动代码表示学习与自动摘要技术的迭代发展。

背景与挑战

背景概述

在人工智能与软件工程交叉领域，代码表示学习逐渐成为研究热点，旨在通过机器学习模型理解程序语义。ogbg-code2数据集由斯坦福大学研究团队于2020年构建，作为开放图基准（Open Graph Benchmark）的重要组成部分，专注于代码摘要任务。该数据集从GitHub CodeSearchNet中提取了45万余个Python方法定义，并将其转化为抽象语法树（AST）图结构，核心研究问题在于预测方法名称的子词序列，从而推动代码自动生成与文档化技术的发展，对程序理解与智能编程辅助工具产生了深远影响。

当前挑战

ogbg-code2数据集面临的挑战主要体现在两个方面：其一，在领域问题层面，代码摘要任务需处理自然语言与形式化编程语言间的语义鸿沟，模型必须精准捕捉AST中节点特征与结构信息，以生成简洁且符合语法的方法名称，这对图神经网络的表示能力提出了较高要求；其二，在构建过程中，从大规模开源代码库中提取并清洗方法定义、构建统一的AST图表示、以及设计合理的节点与边特征，均涉及复杂的工程处理与数据标准化难题，确保数据质量与一致性成为关键挑战。

常用场景

经典使用场景

在程序语言处理与图机器学习交叉领域，ogbg-code2数据集以其基于抽象语法树的结构化表示，为代码摘要任务提供了经典范例。该数据集将Python方法体转化为图结构，节点对应代码元素，边反映语法关系，使研究者能够训练模型从复杂代码逻辑中自动生成简明的方法名称。这一场景不仅深化了对代码语义的理解，还推动了图神经网络在软件工程中的创新应用。

解决学术问题

该数据集有效解决了代码自动摘要中的关键学术挑战，即如何从非结构化的源代码中提取语义信息并生成人类可读的概括。通过提供大规模的标注数据，它支持模型学习代码语法与功能之间的映射关系，促进了程序理解、代码生成和软件维护等领域的研究。其意义在于为评估图神经网络在复杂结构化数据上的性能建立了标准化基准，加速了智能编程辅助工具的发展。

衍生相关工作

基于ogbg-code2数据集，学术界衍生出多项经典工作，如结合图注意力机制的代码摘要模型、多任务学习框架下的程序理解方法，以及跨语言代码迁移研究。这些工作进一步探索了图神经网络在代码表征学习中的潜力，推动了如CodeBERT、GraphCodeBERT等预训练模型的发展，并在OGB等公开排行榜上形成了持续的技术迭代与性能突破。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集