graphs-datasets/AIDS

Name: graphs-datasets/AIDS
Creator: graphs-datasets
Published: 2023-02-07 16:38:52
License: 暂无描述

Hugging Face2023-02-07 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/graphs-datasets/AIDS

下载链接

链接失效反馈

官方服务：

资源简介：

--- licence: unknown task_categories: - graph-ml --- # Dataset Card for AIDS ## Table of Contents - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [External Use](#external-use) - [PyGeometric](#pygeometric) - [Dataset Structure](#dataset-structure) - [Data Properties](#data-properties) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Additional Information](#additional-information) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **[Homepage](https://wiki.nci.nih.gov/display/NCIDTPdata/AIDS+Antiviral+Screen+Data)** - **Paper:**: (see citation) - **Leaderboard:**: [Papers with code leaderboard](https://paperswithcode.com/sota/graph-classification-on-aids) ### Dataset Summary The `AIDS` dataset is a dataset containing compounds checked for evidence of anti-HIV activity.. ### Supported Tasks and Leaderboards `AIDS` should be used for molecular classification, a binary classification task. The score used is accuracy with cross validation. ## External Use ### PyGeometric To load in PyGeometric, do the following: ```python from datasets import load_dataset from torch_geometric.data import Data from torch_geometric.loader import DataLoader dataset_hf = load_dataset("graphs-datasets/<mydataset>") # For the train set (replace by valid or test as needed) dataset_pg_list = [Data(graph) for graph in dataset_hf["train"]] dataset_pg = DataLoader(dataset_pg_list) ``` ## Dataset Structure ### Data Properties | property | value | |---|---| | scale | medium | | #graphs | 1999 | | average #nodes | 15.5875 | | average #edges | 32.39 | ### Data Fields Each row of a given file is a graph, with: - `node_feat` (list: #nodes x #node-features): nodes - `edge_index` (list: 2 x #edges): pairs of nodes constituting edges - `edge_attr` (list: #edges x #edge-features): for the aforementioned edges, contains their features - `y` (list: 1 x #labels): contains the number of labels available to predict (here 1, equal to zero or one) - `num_nodes` (int): number of nodes of the graph ### Data Splits This data is not split, and should be used with cross validation. It comes from the PyGeometric version of the dataset. ## Additional Information ### Licensing Information The dataset has been released under license unknown. ### Citation Information ``` @inproceedings{Morris+2020, title={TUDataset: A collection of benchmark datasets for learning with graphs}, author={Christopher Morris and Nils M. Kriege and Franka Bause and Kristian Kersting and Petra Mutzel and Marion Neumann}, booktitle={ICML 2020 Workshop on Graph Representation Learning and Beyond (GRL+ 2020)}, archivePrefix={arXiv}, eprint={2007.08663}, url={www.graphlearning.io}, year={2020} } ``` ``` @InProceedings{10.1007/978-3-540-89689-0_33, author="Riesen, Kaspar and Bunke, Horst", editor="da Vitoria Lobo, Niels and Kasparis, Takis and Roli, Fabio and Kwok, James T. and Georgiopoulos, Michael and Anagnostopoulos, Georgios C. and Loog, Marco", title="IAM Graph Database Repository for Graph Based Pattern Recognition and Machine Learning", booktitle="Structural, Syntactic, and Statistical Pattern Recognition", year="2008", publisher="Springer Berlin Heidelberg", address="Berlin, Heidelberg", pages="287--297", abstract="In recent years the use of graph based representation has gained popularity in pattern recognition and machine learning. As a matter of fact, object representation by means of graphs has a number of advantages over feature vectors. Therefore, various algorithms for graph based machine learning have been proposed in the literature. However, in contrast with the emerging interest in graph based representation, a lack of standardized graph data sets for benchmarking can be observed. Common practice is that researchers use their own data sets, and this behavior cumbers the objective evaluation of the proposed methods. In order to make the different approaches in graph based machine learning better comparable, the present paper aims at introducing a repository of graph data sets and corresponding benchmarks, covering a wide spectrum of different applications.", isbn="978-3-540-89689-0" } ```

--- 许可证：未知任务类别： - 图机器学习（graph-ml） --- # 数据集卡片：AIDS ## 目录 - [目录](#目录) - [数据集描述](#数据集描述) - [数据集概述](#数据集概述) - [支持任务与排行榜](#支持任务与排行榜) - [外部使用](#外部使用) - [PyGeometric](#pygeometric) - [数据集结构](#数据集结构) - [数据属性](#数据属性) - [数据字段](#数据字段) - [数据划分](#数据划分) - [附加信息](#附加信息) - [许可证信息](#许可证信息) - [引用信息](#引用信息) - [贡献说明](#贡献说明) ## 数据集描述 - **[主页](https://wiki.nci.nih.gov/display/NCIDTPdata/AIDS+Antiviral+Screen+Data)** - **论文:**：（详见引用信息） - **排行榜:**：[Papers with code排行榜](https://paperswithcode.com/sota/graph-classification-on-aids) ### 数据集概述 `AIDS`数据集为经抗HIV活性验证的化合物数据集。 ### 支持任务与排行榜 `AIDS`数据集适用于分子分类任务，该任务属于二分类任务，评估指标为带交叉验证的准确率。 ## 外部使用 ### PyGeometric 若需在PyGeometric中加载该数据集，请执行如下代码： python from datasets import load_dataset from torch_geometric.data import Data from torch_geometric.loader import DataLoader dataset_hf = load_dataset("graphs-datasets/<mydataset>") # 针对训练集（可根据需要替换为验证集或测试集） dataset_pg_list = [Data(graph) for graph in dataset_hf["train"]] dataset_pg = DataLoader(dataset_pg_list) ## 数据集结构 ### 数据属性 | 属性 | 取值 | |---|---| | 数据规模 | 中等 | | 图总数 | 1999 | | 平均节点数 | 15.5875 | | 平均边数 | 32.39 | ### 数据字段每个文件的每一行对应一张图，包含以下字段： - `node_feat`（列表：节点数 × 节点特征数）：节点特征 - `edge_index`（列表：2 × 边数）：构成边的节点对 - `edge_attr`（列表：边数 × 边特征数）：对应上述边的特征信息 - `y`（列表：1 × 标签数）：待预测的标签集合（此处标签数为1，取值为0或1） - `num_nodes`（整数）：该图的节点总数 ### 数据划分该数据集未进行预划分，建议通过交叉验证开展实验，其源自PyGeometric版本的数据集。 ## 附加信息 ### 许可证信息该数据集的许可证类型未知。 ### 引用信息 @inproceedings{Morris+2020, title={TUDataset: A collection of benchmark datasets for learning with graphs}, author={Christopher Morris and Nils M. Kriege and Franka Bause and Kristian Kersting and Petra Mutzel and Marion Neumann}, booktitle={ICML 2020 Workshop on Graph Representation Learning and Beyond (GRL+ 2020)}, archivePrefix={arXiv}, eprint={2007.08663}, url={www.graphlearning.io}, year={2020} } @InProceedings{10.1007/978-3-540-89689-0_33, author="Riesen, Kaspar and Bunke, Horst", editor="da Vitoria Lobo, Niels and Kasparis, Takis and Roli, Fabio and Kwok, James T. and Georgiopoulos, Michael and Anagnostopoulos, Georgios C. and Loog, Marco", title="IAM Graph Database Repository for Graph Based Pattern Recognition and Machine Learning", booktitle="Structural, Syntactic, and Statistical Pattern Recognition", year="2008", publisher="Springer Berlin Heidelberg", address="Berlin, Heidelberg", pages="287--297", abstract="In recent years the use of graph based representation has gained popularity in pattern recognition and machine learning. As a matter of fact, object representation by means of graphs has a number of advantages over feature vectors. Therefore, various algorithms for graph based machine learning have been proposed in the literature. However, in contrast with the emerging interest in graph based representation, a lack of standardized graph data sets for benchmarking can be observed. Common practice is that researchers use their own data sets, and this behavior cumbers the objective evaluation of the proposed methods. In order to make the different approaches in graph based machine learning better comparable, the present paper aims at introducing a repository of graph data sets and corresponding benchmarks, covering a wide spectrum of different applications.", isbn="978-3-540-89689-0" }

提供机构：

graphs-datasets

原始信息汇总

数据集卡片 for AIDS

数据集描述

主页: AIDS Antiviral Screen Data
论文: (见引用)
排行榜: Papers with code leaderboard

数据集概述

AIDS 数据集包含用于检测抗HIV活性的化合物。

支持的任务和排行榜

AIDS 应用于分子分类，这是一个二元分类任务。评分标准是使用交叉验证的准确性。

外部使用

PyGeometric

在 PyGeometric 中加载数据集的示例如下：

python from datasets import load_dataset from torch_geometric.data import Data from torch_geometric.loader import DataLoader

dataset_hf = load_dataset("graphs-datasets/<mydataset>")

对于训练集（根据需要替换为验证集或测试集）

dataset_pg_list = [Data(graph) for graph in dataset_hf["train"]] dataset_pg = DataLoader(dataset_pg_list)

数据集结构

数据属性

属性	值
规模	中等
图数量	1999
平均节点数	15.5875
平均边数	32.39

数据字段

每个文件的每一行是一个图，包含以下字段：

node_feat (列表: #节点 x #节点特征): 节点
edge_index (列表: 2 x #边): 构成边的节点对
edge_attr (列表: #边 x #边特征): 上述边的特征
y (列表: 1 x #标签): 可预测的标签数量（这里为1，等于零或一）
num_nodes (整数): 图的节点数量

数据分割

该数据集未分割，应使用交叉验证。数据来自 PyGeometric 版本的数据集。

附加信息

许可信息

数据集的许可未知。

引用信息

@inproceedings{Morris+2020, title={TUDataset: A collection of benchmark datasets for learning with graphs}, author={Christopher Morris and Nils M. Kriege and Franka Bause and Kristian Kersting and Petra Mutzel and Marion Neumann}, booktitle={ICML 2020 Workshop on Graph Representation Learning and Beyond (GRL+ 2020)}, archivePrefix={arXiv}, eprint={2007.08663}, url={www.graphlearning.io}, year={2020} }

@InProceedings{10.1007/978-3-540-89689-0_33, author="Riesen, Kaspar and Bunke, Horst", editor="da Vitoria Lobo, Niels and Kasparis, Takis and Roli, Fabio and Kwok, James T. and Georgiopoulos, Michael and Anagnostopoulos, Georgios C. and Loog, Marco", title="IAM Graph Database Repository for Graph Based Pattern Recognition and Machine Learning", booktitle="Structural, Syntactic, and Statistical Pattern Recognition", year="2008", publisher="Springer Berlin Heidelberg", address="Berlin, Heidelberg", pages="287--297", abstract="In recent years the use of graph based representation has gained popularity in pattern recognition and machine learning. As a matter of fact, object representation by means of graphs has a number of advantages over feature vectors. Therefore, various algorithms for graph based machine learning have been proposed in the literature. However, in contrast with the emerging interest in graph based representation, a lack of standardized graph data sets for benchmarking can be observed. Common practice is that researchers use their own data sets, and this behavior cumbers the objective evaluation of the proposed methods. In order to make the different approaches in graph based machine learning better comparable, the present paper aims at introducing a repository of graph data sets and corresponding benchmarks, covering a wide spectrum of different applications.", isbn="978-3-540-89689-0" }

搜集汇总

数据集介绍

构建方式

AIDS数据集的构建基于对化合物进行抗HIV活性的检测，旨在为分子分类任务提供丰富的数据支持。该数据集包含了1999个图结构，每个图代表一个化合物，图中的节点和边分别表示化合物的原子和化学键。数据集的平均节点数为15.5875，平均边数为32.39，确保了数据的多样性和复杂性。通过这种方式，AIDS数据集为研究者提供了一个标准化的基准，用于评估和比较不同的图分类算法。

特点

AIDS数据集的主要特点在于其专注于分子分类任务，提供了一个二分类问题，即判断化合物是否具有抗HIV活性。数据集的图结构丰富，包含了节点特征、边索引、边特征以及标签信息，这些特征为图神经网络等模型提供了全面的数据输入。此外，数据集未进行预先的训练、验证和测试集划分，建议使用交叉验证方法进行模型评估，以确保结果的可靠性。

使用方法

使用AIDS数据集时，研究者可以通过PyGeometric等图神经网络框架进行加载和处理。具体而言，可以使用`torch_geometric.data.Data`类将数据集中的每个图转换为PyGeometric支持的格式，并通过`DataLoader`进行批量处理。数据集的标签信息可用于训练和评估模型，而节点和边的特征则为模型提供了详细的结构信息。通过这种方式，AIDS数据集为分子分类任务提供了强大的数据支持，有助于推动图神经网络在生物信息学领域的应用。

背景与挑战

背景概述

AIDS数据集是一个专门用于抗HIV活性化合物筛选的图数据集，由Morris等人于2020年引入，作为TUDataset集合的一部分。该数据集的核心研究问题集中在分子分类，特别是二元分类任务，旨在评估化合物是否具有抗HIV活性。AIDS数据集的构建源于对标准化图数据集的需求，以支持图结构模式识别和机器学习领域的研究。通过提供1999个图，平均每个图包含约15.5875个节点和32.39条边，AIDS数据集为分子分类任务提供了丰富的资源，推动了图机器学习在生物信息学中的应用。

当前挑战

AIDS数据集在分子分类任务中面临的主要挑战包括数据集的规模和复杂性。尽管数据集包含1999个图，但每个图的节点和边数量相对较少，这可能限制了模型捕捉复杂分子结构特征的能力。此外，数据集未进行预先分割，要求研究者采用交叉验证方法，这增加了模型训练和评估的复杂性。另一个挑战是数据集的许可信息未知，这可能影响其在某些研究项目中的使用。最后，由于数据集主要用于二元分类，如何扩展其应用范围以支持更复杂的分子特性预测，也是一个值得探讨的问题。

常用场景

经典使用场景

在分子分类领域，AIDS数据集被广泛应用于抗HIV活性化合物的二元分类任务。该数据集通过提供详细的分子图结构信息，包括节点特征、边特征以及标签，支持研究者进行图神经网络（Graph Neural Networks, GNNs）的训练与评估。通过交叉验证的方式，研究者可以有效评估模型在抗HIV活性预测任务中的准确性，从而推动分子图分类技术的发展。

衍生相关工作

AIDS数据集的发布激发了大量基于图神经网络的研究工作。例如，研究者利用该数据集开发了多种图分类算法，并在分子分类任务中取得了显著进展。此外，AIDS数据集还被用于验证新型图嵌入技术，推动了图表示学习领域的发展。这些衍生工作不仅丰富了图神经网络的理论体系，还为分子科学和药物研发提供了新的工具和方法。

数据集最近研究