The SLC-TAHG (SLC-Tumor Textual Attribute Hyper-relational KG Dataset)
收藏DataCite Commons2026-02-02 更新2026-05-05 收录
下载链接:
https://www.scidb.cn/detail?dataSetId=d2461f91de8c49e5b545241119d41f1c
下载链接
链接失效反馈官方服务:
资源简介:
The SLC-TAHG dataset is the first Hyper-relational Knowledge Graph (HKG) benchmark specifically designed for the SLC (Solute Carrier) protein-tumor domain. The primary objective of this dataset is to provide a rigorous platform for evaluating model performance in cascaded biological reasoning under few-shot constraints, leveraging both the topological structure of biological networks and the rich semantic information from biomedical literature.Our research focuses on extracting and parsing the cascaded associations between SLC proteins, biological pathways, and diseases. A key feature of SLC-TAHG is its multi-step prediction task: predicting the relationship between an SLC protein and a pathway (A—>B), and subsequently predicting the pathway-disease association (B—>C) conditioned on the prior result. Preliminary benchmarks indicate that the integration of textual attributes (titles and abstracts) provides critical semantic priors that mitigate the sparsity inherent in biomedical data. Furthermore, the dataset reveals that Hypergraph Neural Networks exhibit superior capabilities in capturing multi-dimensional auxiliary attributes—such as specific experimental conditions or tissue contexts—compared to traditional triple-based models.Dataset ComponentsThe SLC-TAHG dataset consists of three core textual evidence files (JSON) and one comprehensive graph database file (Neo4j dump):1.Textual Evidence Data (MongoDB JSON format):SLCdb.papers.json: A foundational repository of core literature supporting SLC protein research. It contains a vast collection of metadata, including Titles and Abstracts sourced from PubMed, providing deep semantic features for graph nodes.SLCdb.slc_pathway.json: Contains literature-based evidence specifically documenting the associations between SLC proteins and biological pathways, serving as the ground truth for the first stage ($A \to B$) of the cascaded reasoning task.SLCdb.pathway_disease.json: Stores evidence-based data linking biological pathways to tumor diseases, utilized for the validation and inference of the second stage ($B \to C$) in the cascaded pipeline2.Comprehensive Graph Database (Neo4j format):neo4j.dump: The primary output of this study, containing the complete Hyper-relational Knowledge Graph. The graph features 8 distinct node types (e.g., SLCProtein, Biological Pathway, Disease, Product, etc.) and 8 types of relational edges. This file can be directly imported into a Neo4j environment for multi-hop path querying, visualization, and the training of Graph Machine Learning models.
提供机构:
Science Data Bank
创建时间:
2026-02-02



