kgbench: mdgenre and mdgender

NIAID Data Ecosystem2026-03-12 收录

下载链接：

https://zenodo.org/record/4361794

下载链接

链接失效反馈

官方服务：

资源简介：

Graph neural networks and other machine learning models offer a promising direction for interpretable machine learning on relational and multimodal data. Until now, however, progress in this area is difficult to gauge. This is primarily due to a limited number of datasets with (a) a high enough number of labeled nodes in the test set for precise measurement of performance, and (b) a rich enough variety of of multimodal information to learn from. Here, we introduce a set of new benchmark tasks for node classification on knowledge graphs. We focus primarily on node classification, since this setting cannot be solved purely by node embedding models, instead requiring the model to pool information from several steps away in the graph. However, the datasets may also be used for link prediction. For each dataset, we provide test and validation sets of at least 1000 instances, with some containing more than 10\;000 instances. Each task can be performed in a purely relational manner, to evaluate the performance of a relational graph model in isolation, or with multimodal information, to evaluate the performance of multimodal relational graph models. All datasets are packaged in a CSV format that is easily consumable in any machine learning environment, together with the original source data in RDF and pre-processing code for full provenance. We provide code for loading the data into \texttt{numpy} and \texttt{pytorch}. We compute performance for several baseline models.

图神经网络（Graph Neural Networks）与其他机器学习模型为关系型与多模态数据上的可解释机器学习研究提供了极具前景的方向。然而迄今为止，该领域的进展难以量化评估，究其根本，现有数据集存在两大局限：一是测试集中带标签节点的数量不足以精准衡量模型性能；二是可供学习的多模态信息种类不够丰富。为此，本研究推出了一系列针对知识图谱（Knowledge Graphs）节点分类（Node Classification）的全新基准任务。本次研究主要聚焦节点分类任务，因为此类任务无法仅通过节点嵌入模型解决，而是需要模型整合图中多跳邻域的信息；不过本数据集也可用于链路预测（Link Prediction）任务。每个数据集均提供至少1000条样本的测试集与验证集，部分数据集的样本量甚至超过10000条。每项任务均可仅基于关系型数据开展，以独立评估关系图模型的性能；也可结合多模态信息进行，用于评测多模态关系图模型的表现。所有数据集均以CSV格式（Comma-Separated Values）封装，可在任意机器学习环境中便捷调用，同时附带RDF格式（Resource Description Framework）的原始源数据与具备完整溯源性的预处理代码。我们提供了将数据加载至NumPy（NumPy）与PyTorch（PyTorch）框架的代码，并针对多款基线模型计算了基准性能。

创建时间：

2020-12-21

5,000+

优质数据集

54 个

任务类型

进入经典数据集