Graphcore/GTSQA
收藏Hugging Face2025-11-07 更新2026-01-03 收录
下载链接:
https://hf-mirror.com/datasets/Graphcore/GTSQA
下载链接
链接失效反馈官方服务:
资源简介:
---
license:
- cc-by-4.0
language:
- en
multilinguality:
- monolingual
size_categories:
- 10k<n<100k
library_name: datasets
task_categories:
- question-answering
- graph-ml
task_ids:
- extractive-qa
pretty_name: GTSQA
configs:
- config_name: gtsqa
default: true
data_files:
- split: train
path: gtsqa/train.parquet
- split: test
path: gtsqa/test.parquet
- config_name: gtsqa-with-graphs
data_files:
- split: train
path: gtsqa-with-graphs/train-*
- split: test
path: gtsqa-with-graphs/test-*
---
# Dataset card for GTSQA
## Dataset Summary
GTSQA is a synthetic Knowledge Graph Question Answering dataset constructed from Wikidata, using the [SynthKGQA framework](https://github.com/graphcore-research/synth-kgqa). It offers a challenging benchmark for GraphRAG models and KG-augmented LLMs, and enables the stand-alone evaluation of a KG retriever's performance, by providing the set of ground-truth KG edges that are required to reason over each question. It is specifically designed to test generalization abilities of KG retrievers with respect to unseen relation types and isomorphism types of the ground-truth answer subgraph.
## Dataset References
- **Paper:** [Ground-Truth Subgraphs for Better Training and Evaluation of Knowledge Graph Augmented LLMs](https://arxiv.org/abs/2511.04473)
- **Repository:** [https://github.com/graphcore-research/synth-kgqa](https://github.com/graphcore-research/synth-kgqa)
- **Point of Contact:** please open an issue in the github repository, or use the HF Community tab
## Dataset Structure
### Data Instances
This is an example of a datapoint in GTSQA:
```json
{
"id": 40513,
"question": "Who directed the Italian film, originally in French, that is based on `The Vicomte of Bragelonne: Ten Years Later'?",
"paraphrased_question": "Who was the director of the Italian film, originally in French, inspired by `The Vicomte of Bragelonne: Ten Years Later'?",
"seed_entities": ["Italy (Q38)", "French (Q150)", "The Vicomte of Bragelonne: Ten Years Later (Q769001)"],
"answer_node": "Fernando Cerchio (Q503508)",
"answer_subgraph": [["Le Vicomte de Bragelonne (Q3228085)", "country of origin (P495)", "Italy (Q38)"],
["Le Vicomte de Bragelonne (Q3228085)", "original language of film or TV show (P364)", "French (Q150)"],
["Le Vicomte de Bragelonne (Q3228085)", "based on (P144)", "The Vicomte of Bragelonne: Ten Years Later (Q769001)"],
["Le Vicomte de Bragelonne (Q3228085)", "director (P57)", "Fernando Cerchio (Q503508)"]],
"sparql_query": "SELECT ?answer WHERE { ?film wdt:P495 wd:Q38; wdt:P364 wd:Q150; wdt:P144 wd:Q769001; wdt:P57 ?answer.}",
"all_answers_wikidata": ["Q503508", "Q679016"],
"full_answer_subgraph_wikidata": [["Q2260875", "P495", "Q38"],
["Q2260875", "P364", "Q150"],
["Q2260875", "P144", "Q769001"],
["Q226087", "P57", "Q679016"],
["Q322808", "P495", "Q38"],
["Q3228085", "P364", "Q150"],
["Q3228085", "P144", "Q769001"],
["Q3228085", "P57", "Q503508"]],
"all_answers_wikikg2": ["Q503508"],
"full_answer_subgraph_wikikg2": [["Q3228085", "P364", "Q150"],
["Q3228085", "P57", "Q503508"],
["Q3228085", "P144", "Q769001"],
["Q3228085", "P495", "Q38"]],
"n_hops": 2,
"graph_isomorphism": "((1)(1)(1))",
"redundant": True,
"minimal_graph_isomorphism": "((1)(1))",
"minimal_seeds_and_queries": "{'Q150-Q769001': 'SELECT ?answer WHERE { ?a wdt:P364 wd:Q150. ?a wdt:P57 ?answer. ?a wdt:P144 wd:Q769001.}'}",
"test_type": ["training"],
}
```
### Data Fields
- `id` (int): datapoint id in the dataset.
- `question` (string): question in natural language form.
- `paraphrased_question` (string): LLM-paraphrased question. Only provided for training questions; for test questions, only use the formulation in `question`, which has been curated.
- `seed_entities` (list[string]): Wikidata entities mentioned in the question, in the form "entity label (Wikidata QID)".
- `answer_node` (string): the answer entity used by the LLM to generate the datapoint.
- `answer_subgraph` (list[list[string]]): the subgraph of supporting facts needed to answer the question, used by the LLM to generate the datapoint. Each fact is a KG triple in Wikidata (entity, relation, entity); entities are expressed in the form "entity label (Wikidata QID)", relations in the form "relation label (Wikidata PID)".
- `sparql_query` (string): the [Wikidata SPARQL query](https://query.wikidata.org/) which encodes the natural language question in logical form.
- `all_answer_wikidata` (list[string]): set of all correct question answers, retrieved from Wikidata by running the SPARQL query. We only provide Wikidata QIDs.
- `full_answer_subgraph_wikidata` (list[list[string]]): the full answer subgraph in Wikidata, retrieved by running the SPARQL query in [CONSTRUCT form](https://www.w3.org/TR/rdf-sparql-query/#construct), i.e., the union of the sets of triples used in any valid realization of the query. We only provide Wikidata head/tail QIDs and relation PIDs, for each triple.
- `all_answers_wikikg2` (list[string]): set of all correct question answers in [ogbl-wikikg2](https://ogb.stanford.edu/docs/linkprop/#ogbl-wikikg2), retrieved by running the SPARQL query against it. We only provide Wikidata QIDs.
- `full_answer_subgraph_wikikg2` (list[list[string]]): the full answer subgraph in [ogbl-wikikg2](https://ogb.stanford.edu/docs/linkprop/#ogbl-wikikg2), retrieved by running the SPARQL query in [CONSTRUCT form](https://www.w3.org/TR/rdf-sparql-query/#construct). We only provide Wikidata head/tail QIDs and relation PIDs, for each triple.
- `n_hops` (int): maximum number of hops separating the seed entities from the answer entity.
- `graph_isomorphism` (string): classification, up to isomorphism, of the answer subgraph as a labelled graph (where nodes are labelled as seeds, answer or intermediate; see [paper](https://arxiv.org/abs/2511.04473)).
- `redundant` (bool): whether the question contains redundant information, i.e., if it can be answered with a subset of the seed entities.
- `minimal_graph_isomorphism` (string): isomorphism type of the answer subgraph when discarding paths leading to redundant seed nodes.
- `minimal_seeds_and_queries` (string): minimal subset(s) of seed entities that are sufficient to answer the question, with the corresponding SPARQL query.
- `test_type` (list[string]): generalization type(s) of the test question (`"in-distribution"`; `"unseen_relation_type"`; `"unseen_graph_type"`). Can be disregarded for training questions.
### Subsets
#### gtsqa
The full GTSQA dataset, containing 30,477 training questions and 1622 test questions.
Size: 13 MB
#### gtsqa-with-graphs
This version of the datasets additionally provides an extra data field `graph`, containing question-specific graphs (extracted from ogbl-wikikg2 with Personalized PageRank, see [script](https://github.com/graphcore-research/synth-kgqa/blob/main/synth_kgqa/compute_neighs_and_sp.py)), each made of up to 30k edges around the seed entities. These are the official graphs to use to perform retrieval, when retrieval from the full KG is not possible. For each triple in the graph, we provide entity/relation labels and Wikidata QID/PIDs, in the form "label (Wikidata ID)". If using these graphs for retrieval, one should use the data in the fields `all_answers_wikikg2`and `full_answer_subgraph_wikikg2` as ground-truths.
Size: 14.5 GB
## Licensing Information
The dataset is released under [Creative Commons Attribution 4.0](https://creativecommons.org/licenses/by/4.0/deed.en) license.
## Citation
When using the GTSQA dataset, please cite the paper.
```
@misc{cattaneo2025,
title={Ground-Truth Subgraphs for Better Training and Evaluation of Knowledge Graph Augmented LLMs},
author={Alberto Cattaneo and Carlo Luschi and Daniel Justus},
year={2025},
eprint={2511.04473},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2511.04473},
}
```
---
许可证:
- cc-by-4.0(知识共享署名4.0)
语言:
- en(英语)
多语言类型:
- 单语种(monolingual)
样本规模类别:
- 10k<n<100k
依赖库名称:
- datasets
任务类别:
- 问答系统(question-answering)
- 图机器学习(graph-ml)
任务子类型:
- 抽取式问答(extractive-qa)
数据集名称:
- GTSQA
配置项:
- 配置名:gtsqa
默认配置:是
数据文件:
- 拆分集:训练集(train)
路径:gtsqa/train.parquet
- 拆分集:测试集(test)
路径:gtsqa/test.parquet
- 配置名:gtsqa-with-graphs
数据文件:
- 拆分集:训练集(train)
路径:gtsqa-with-graphs/train-*
- 拆分集:测试集(test)
路径:gtsqa-with-graphs/test-*
---
# GTSQA 数据集卡片
## 数据集概述
GTSQA是一个基于维基数据(Wikidata)构建的合成知识图谱问答(Knowledge Graph Question Answering, KGQA)数据集,采用[SynthKGQA框架](https://github.com/graphcore-research/synth-kgqa)生成。它为图检索增强生成(GraphRAG)模型与知识图谱增强大语言模型(Large Language Model, LLM)提供了极具挑战性的基准测试集;同时,通过提供每个问题所需的真实知识图谱边集合,支持对知识图谱检索器的性能进行独立评估。该数据集专门用于测试知识图谱检索器在未见关系类型与真实答案子图同构类型方面的泛化能力。
## 数据集引用
- **论文**:[面向知识图谱增强大语言模型的更优训练与评估的真实子图](https://arxiv.org/abs/2511.04473)
- **代码仓库**:[https://github.com/graphcore-research/synth-kgqa](https://github.com/graphcore-research/synth-kgqa)
- **联系方式**:请在GitHub仓库中提交Issue,或使用Hugging Face社区板块。
## 数据集结构
### 数据实例
以下为GTSQA中的一条数据点示例:
json
{
"id": 40513,
"question": "Who directed the Italian film, originally in French, that is based on `The Vicomte of Bragelonne: Ten Years Later'?",
"paraphrased_question": "Who was the director of the Italian film, originally in French, inspired by `The Vicomte of Bragelonne: Ten Years Later'?",
"seed_entities": ["Italy (Q38)", "French (Q150)", "The Vicomte of Bragelonne: Ten Years Later (Q769001)"],
"answer_node": "Fernando Cerchio (Q503508)",
"answer_subgraph": [["Le Vicomte de Bragelonne (Q3228085)", "country of origin (P495)", "Italy (Q38)"],
["Le Vicomte de Bragelonne (Q3228085)", "original language of film or TV show (P364)", "French (Q150)"],
["Le Vicomte de Bragelonne (Q3228085)", "based on (P144)", "The Vicomte of Bragelonne: Ten Years Later (Q769001)"],
["Le Vicomte de Bragelonne (Q3228085)", "director (P57)", "Fernando Cerchio (Q503508)"]],
"sparql_query": "SELECT ?answer WHERE { ?film wdt:P495 wd:Q38; wdt:P364 wd:Q150; wdt:P144 wd:Q769001; wdt:P57 ?answer.}",
"all_answers_wikidata": ["Q503508", "Q679016"],
"full_answer_subgraph_wikidata": [["Q2260875", "P495", "Q38"],
["Q2260875", "P364", "Q150"],
["Q2260875", "P144", "Q769001"],
["Q226087", "P57", "Q679016"],
["Q322808", "P495", "Q38"],
["Q3228085", "P364", "Q150"],
["Q3228085", "P144", "Q769001"],
["Q3228085", "P57", "Q503508"]],
"all_answers_wikikg2": ["Q503508"],
"full_answer_subgraph_wikikg2": [["Q3228085", "P364", "Q150"],
["Q3228085", "P57", "Q503508"],
["Q3228085", "P144", "Q769001"],
["Q3228085", "P495", "Q38"]],
"n_hops": 2,
"graph_isomorphism": "((1)(1)(1))",
"redundant": true,
"minimal_graph_isomorphism": "((1)(1))",
"minimal_seeds_and_queries": "{'Q150-Q769001': 'SELECT ?answer WHERE { ?a wdt:P364 wd:Q150. ?a wdt:P57 ?answer. ?a wdt:P144 wd:Q769001.}'}",
"test_type": ["training"]
}
### 数据字段
- `id` (int): 数据集中的数据点唯一标识符。
- `question` (string): 自然语言形式的问题。
- `paraphrased_question` (string): 大语言模型生成的释义问题。仅为训练集问题提供该字段;测试集问题仅使用`question`字段中的表述,该表述已经过人工校准。
- `seed_entities` (list[string]): 问题中提及的维基数据实体,格式为“实体标签 (维基数据QID)”。
- `answer_node` (string): 用于大语言模型生成该数据点的答案实体。
- `answer_subgraph` (list[list[string]]): 回答该问题所需的支撑事实子图,为大语言模型生成该数据点所使用。每个事实均为维基数据格式的知识图谱三元组(实体、关系、实体);实体格式为“实体标签 (维基数据QID)”,关系格式为“关系标签 (维基数据PID)”。
- `sparql_query` (string): 以逻辑形式编码自然语言问题的[维基数据SPARQL查询](https://query.wikidata.org/)。
- `all_answers_wikidata` (list[string]): 通过运行SPARQL查询从维基数据中获取的所有正确答案集合,仅提供维基数据QID。
- `full_answer_subgraph_wikidata` (list[list[string]]): 通过以[CONSTRUCT形式](https://www.w3.org/TR/rdf-sparql-query/#construct)运行SPARQL查询获取的维基数据完整答案子图,即查询所有有效实现中使用的三元组集合的并集。每个三元组仅提供维基数据头/尾实体QID与关系PID。
- `all_answers_wikikg2` (list[string]): 通过针对[ogbl-wikikg2](https://ogb.stanford.edu/docs/linkprop/#ogbl-wikikg2)运行SPARQL查询获取的所有正确答案集合,仅提供维基数据QID。
- `full_answer_subgraph_wikikg2` (list[list[string]]): 通过以[CONSTRUCT形式](https://www.w3.org/TR/rdf-sparql-query/#construct)运行SPARQL查询获取的ogbl-wikikg2完整答案子图。每个三元组仅提供维基数据头/尾实体QID与关系PID。
- `n_hops` (int): 分隔种子实体与答案实体的最大跳数。
- `graph_isomorphism` (string): 答案子图作为带标签图的同构分类(其中节点被标记为种子节点、答案节点或中间节点;详见[论文](https://arxiv.org/abs/2511.04473))。
- `redundant` (bool): 该问题是否包含冗余信息,即是否可通过种子实体的子集完成回答。
- `minimal_graph_isomorphism` (string): 丢弃指向冗余种子节点的路径后,答案子图的同构类型。
- `minimal_seeds_and_queries` (string): 足以回答该问题的最小种子实体子集,以及对应的SPARQL查询。
- `test_type` (list[string]): 测试问题的泛化类型(`"in-distribution"`:分布内;`"unseen_relation_type"`:未见关系类型;`"unseen_graph_type"`:未见图类型)。训练集问题可忽略该字段。
### 数据集子集
#### gtsqa
完整的GTSQA数据集,包含30477条训练问题与1622条测试问题。
数据集大小:13 MB
#### gtsqa-with-graphs
该数据集版本额外提供了`graph`字段,包含针对每个问题的图谱(从ogbl-wikikg2中通过个性化PageRank提取,详见[脚本](https://github.com/graphcore-research/synth-kgqa/blob/main/synth_kgqa/compute_neighs_and_sp.py)),每个图谱围绕种子实体包含至多30000条边。当无法从完整知识图谱中进行检索时,应使用这些官方图谱完成检索任务。对于图谱中的每个三元组,我们提供实体/关系标签与维基数据QID/PID,格式为“标签 (维基数据ID)”。若使用这些图谱进行检索,应使用`all_answers_wikikg2`与`full_answer_subgraph_wikikg2`字段中的数据作为真实标签。
数据集大小:14.5 GB
## 许可信息
本数据集采用[知识共享署名4.0](https://creativecommons.org/licenses/by/4.0/deed.zh)协议发布。
## 引用
使用GTSQA数据集时,请引用下述论文:
bibtex
@misc{cattaneo2025,
title={Ground-Truth Subgraphs for Better Training and Evaluation of Knowledge Graph Augmented LLMs},
author={Alberto Cattaneo and Carlo Luschi and Daniel Justus},
year={2025},
eprint={2511.04473},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2511.04473},
}
提供机构:
Graphcore



