five

s-nlp/ShortPathQA

收藏
Hugging Face2026-03-19 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/s-nlp/ShortPathQA
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en license: apache-2.0 task_categories: - question-answering task_ids: - open-domain-qa tags: - knowledge-graph - wikidata - KGQA - subgraph - reasoning pretty_name: ShortPathQA size_categories: - 10K<n<100K dataset_info: features: - name: sample_id dtype: string - name: question dtype: string - name: questionEntity dtype: string - name: answerEntity dtype: string - name: groundTruthAnswerEntity dtype: string - name: answerEntityId dtype: string - name: questionEntityId dtype: string - name: groundTruthAnswerEntityId dtype: string - name: correct dtype: string - name: graph dtype: string splits: - name: train num_examples: 49923 - name: test num_examples: 10961 - name: manual_test num_examples: 3818 --- # ShortPathQA **ShortPathQA** is the first QA benchmark that pairs natural-language questions with **pre-computed shortest-path subgraphs from Wikidata**, providing a standardized test bed for *controllable fusion* of **large language models (LLMs) and knowledge graphs (KGs)**. ## Dataset Summary Unlike existing KGQA datasets, ShortPathQA removes the heavy lifting of entity linking and path-finding: every sample already contains the ground-truth subgraph connecting the question entities to each answer candidate. This lets researchers focus on **how** a model reasons over graph structure rather than **how** it retrieves it, enabling direct comparison across studies. - **12,526 questions** (from Mintaka + 350 hand-curated hard cases) - **143,061 question–candidate pairs** with pre-computed Wikidata subgraphs - Task: binary classification — *"Is candidate c the correct answer to question q?"* - Apache-2.0 license ## Dataset Structure ### Splits | Split | File | Rows | Description | |---|---|---|---| | `train` | `train_full.tsv` | 49,923 | Training set (from Mintaka train split) | | `test` | `test.tsv` | 10,961 | Automatic test set (from Mintaka test split) | | `manual_test` | `human_annotated_test.tsv` | 3,818 | Manual test set — 350 new questions curated by experts, not seen by any LLM | ### Fields Each row represents one **question–candidate pair**: | Column | Type | Description | |---|---|---| | `sample_id` | string | Unique pair identifier | | `question` | string | Natural language question | | `questionEntity` | string | Comma-separated labels of Wikidata entities mentioned in the question | | `questionEntityId` | string | Comma-separated Wikidata IDs of question entities (e.g. `Q8093, Q9351`) | | `answerEntity` | string | Label of the answer candidate entity | | `answerEntityId` | string | Wikidata ID of the answer candidate (e.g. `Q864`) | | `groundTruthAnswerEntity` | string | Label of the correct answer entity | | `groundTruthAnswerEntityId` | string | Wikidata ID of the correct answer | | `correct` | string | `True` if this candidate is the correct answer, `False` otherwise | | `graph` | string | JSON-serialized Wikidata subgraph (union of shortest paths from question entities to the candidate) | ### Graph Format The `graph` field is a JSON string with two keys: - `nodes` — list of nodes, each with: - `name_`: Wikidata entity ID (e.g. `"Q864"`) - `label`: human-readable name - `type`: one of `QUESTIONS_ENTITY`, `ANSWER_CANDIDATE_ENTITY`, `INTERNAL` - `id`: integer index used in `links` - `links` — list of edges, each with: - `source`, `target`: integer node indices - `name_`: Wikidata property ID (e.g. `"P31"`) - `label`: human-readable relation name **Example entry:** ```json { "question": "\"Pikachu\" comes from what famous Nintendo game?", "questionEntity": "Nintendo, Pikachu", "questionEntityId": "Q8093, Q9351", "answerEntity": "Pokémon", "answerEntityId": "Q864", "groundTruthAnswerEntity": "Pokémon", "groundTruthAnswerEntityId": "Q864", "correct": "True", "graph": { "nodes": [ {"type": "QUESTIONS_ENTITY", "name_": "Q8093", "id": 0, "label": "Nintendo"}, {"type": "ANSWER_CANDIDATE_ENTITY", "name_": "Q864", "id": 1, "label": "Pokémon"}, {"type": "QUESTIONS_ENTITY", "name_": "Q9351", "id": 2, "label": "Pikachu"} ], "links": [ {"name_": "P123", "source": 1, "target": 0, "label": "publisher"}, {"name_": "P8345", "source": 2, "target": 1, "label": "media franchise"} ] } } ``` ## Usage ```python from datasets import load_dataset import json ds = load_dataset("s-nlp/ShortPathQA") # Access a training sample sample = ds["train"][0] graph = json.loads(sample["graph"].replace("'", '"')) # parse graph JSON print(sample["question"]) print("Correct answer:", sample["groundTruthAnswerEntity"]) print("This candidate:", sample["answerEntity"], "| Label:", sample["correct"]) ``` ## Dataset Creation Questions are sourced from [Mintaka](https://github.com/amazon-science/mintaka) (English split, excluding *count*-type questions). Each question is annotated with Wikidata entities; answer candidates are generated by LLMs (T5-based and Mixtral/Mistral) and linked to Wikidata. Subgraphs are computed as the union of shortest paths between question entities and each candidate entity in a Wikidata graph built from an official Wikidata JSON dump. The manual test set consists of 350 new questions written to mirror Mintaka structure but not exposed to any LLM during dataset construction. ## Citation ```bibtex @inproceedings{salnikov2025shortpathqa, title={ShortPathQA: A Dataset for Controllable Fusion of Large Language Models with Knowledge Graphs}, author={Salnikov, Mikhail and Sakhovskiy, Andrey and Nikishina, Irina and Usmanova, Aida and Kraft, Angelie and M{\"o}ller, Cedric and Banerjee, Debayan and Huang, Junbo and Jiang, Longquan and Abdullah, Rana and others}, booktitle={International Conference on Applications of Natural Language to Information Systems}, pages={95--110}, year={2025}, organization={Springer} } ``` Paper: https://link.springer.com/chapter/10.1007/978-3-031-97141-9_7 GitHub: https://github.com/s-nlp/ShortPathQA

语言: - 英语 许可证:Apache-2.0 任务类别: - 问答任务 任务子类型: - 开放域问答 标签: - 知识图谱(Knowledge Graph, KG) - 维基数据(Wikidata) - 知识图谱问答(Knowledge Graph Question Answering, KGQA) - 子图 - 推理 展示名称:ShortPathQA 样本规模:10000 < 样本数 < 100000 数据集信息: 特征: - 名称:sample_id 数据类型:字符串 - 名称:question 数据类型:字符串 - 名称:questionEntity 数据类型:字符串 - 名称:answerEntity 数据类型:字符串 - 名称:groundTruthAnswerEntity 数据类型:字符串 - 名称:answerEntityId 数据类型:字符串 - 名称:questionEntityId 数据类型:字符串 - 名称:groundTruthAnswerEntityId 数据类型:字符串 - 名称:correct 数据类型:字符串 - 名称:graph 数据类型:字符串 划分集: - 名称:train 样本数量:49923 - 名称:test 样本数量:10961 - 名称:manual_test 样本数量:3818 # ShortPathQA **ShortPathQA** 是首个将自然语言问题与**预先计算的维基数据(Wikidata)最短路子图**相结合的问答基准数据集,为**大语言模型(Large Language Model, LLM)与知识图谱(KG)的可控融合**提供了标准化测试平台。 ## 数据集概述 不同于现有的知识图谱问答(KGQA)数据集,ShortPathQA省去了实体链接与路径查找的繁重工作:每个样本均已包含连接问题实体与各候选答案的真实子图。这使得研究者可以专注于**模型如何基于图结构进行推理**,而非**如何检索图结构**,从而实现不同研究间的直接对比。 - 共包含12526个问题(源自Mintaka数据集,外加350个人工精心构建的高难度案例) - 共计143061个带预先计算的维基数据子图的「问题-候选答案」对 - 任务类型:二元分类任务——判断「候选答案c是否为问题q的正确答案」 - 采用Apache-2.0开源许可证 ## 数据集结构 ### 划分方式 | 划分集名称 | 对应文件 | 样本数 | 说明 | |---|---|---|---| | `train` | `train_full.tsv` | 49923 | 训练集(源自Mintaka数据集的训练划分) | | `test` | `test.tsv` | 10961 | 自动测试集(源自Mintaka数据集的测试划分) | | `manual_test` | `human_annotated_test.tsv` | 3818 | 人工测试集——由专家精心构建的350个全新问题,未在任何大语言模型训练中出现过 | ### 字段说明 每一行对应一个**问题-候选答案对**: | 列名 | 数据类型 | 说明 | |---|---|---| | `sample_id` | 字符串 | 唯一的问题-候选对标识符 | | `question` | 字符串 | 自然语言问题 | | `questionEntity` | 字符串 | 问题中提及的维基数据实体标签,以逗号分隔 | | `questionEntityId` | 字符串 | 问题实体的维基数据ID,以逗号分隔(例如`Q8093, Q9351`) | | `answerEntity` | 字符串 | 候选答案实体的标签 | | `answerEntityId` | 字符串 | 候选答案实体的维基数据ID(例如`Q864`) | | `groundTruthAnswerEntity` | 字符串 | 正确答案实体的标签 | | `groundTruthAnswerEntityId` | 字符串 | 正确答案实体的维基数据ID | | `correct` | 字符串 | 若该候选为正确答案则为`True`,否则为`False` | | `graph` | 字符串 | JSON序列化的维基数据子图(连接问题实体与候选答案的所有最短路径的并集) | ### 图结构格式 `graph`字段为包含两个键的JSON字符串: - `nodes` — 节点列表,每个节点包含: - `name_`:维基数据实体ID(例如`"Q864"`) - `label`:人类可读的实体名称 - `type`:枚举值,可选`QUESTIONS_ENTITY`(问题实体)、`ANSWER_CANDIDATE_ENTITY`(候选答案实体)、`INTERNAL`(内部节点) - `id`:`links`中使用的整数索引 - `links` — 边列表,每条边包含: - `source`、`target`:整数类型的节点索引 - `name_`:维基数据属性ID(例如`"P31"`) - `label`:人类可读的关系名称 **示例条目:** json { "question": ""皮卡丘"源自哪款知名任天堂游戏?", "questionEntity": "任天堂, 皮卡丘", "questionEntityId": "Q8093, Q9351", "answerEntity": "宝可梦", "answerEntityId": "Q864", "groundTruthAnswerEntity": "宝可梦", "groundTruthAnswerEntityId": "Q864", "correct": "True", "graph": { "nodes": [ {"type": "QUESTIONS_ENTITY", "name_": "Q8093", "id": 0, "label": "任天堂"}, {"type": "ANSWER_CANDIDATE_ENTITY", "name_": "Q864", "id": 1, "label": "宝可梦"}, {"type": "QUESTIONS_ENTITY", "name_": "Q9351", "id": 2, "label": "皮卡丘"} ], "links": [ {"name_": "P123", "source": 1, "target": 0, "label": "发行方"}, {"name_": "P8345", "source": 2, "target": 1, "label": "媒体系列作品"} ] } } ## 使用方法 python from datasets import load_dataset import json ds = load_dataset("s-nlp/ShortPathQA") # 访问训练样本 sample = ds["train"][0] graph = json.loads(sample["graph"].replace("'", '"')) # 解析图结构JSON print(sample["question"]) print("正确答案:", sample["groundTruthAnswerEntity"]) print("当前候选答案:", sample["answerEntity"], "| 标签:", sample["correct"]) ## 数据集构建 问题源自[Mintaka](https://github.com/amazon-science/mintaka)数据集的英语划分,且排除了计数类问题。每个问题均已标注对应的维基数据实体;候选答案由大语言模型(基于T5以及Mixtral/Mistral)生成并链接至维基数据。子图通过官方维基数据JSON转储文件构建的图谱中,计算问题实体与各候选实体间所有最短路径的并集得到。 人工测试集包含350个全新问题,其结构与Mintaka数据集保持一致,但在数据集构建过程中未向任何大语言模型公开过。 ## 引用 bibtex @inproceedings{salnikov2025shortpathqa, title={ShortPathQA: A Dataset for Controllable Fusion of Large Language Models with Knowledge Graphs}, author={Salnikov, Mikhail and Sakhovskiy, Andrey and Nikishina, Irina and Usmanova, Aida and Kraft, Angelie and M{"o}ller, Cedric and Banerjee, Debayan and Huang, Junbo and Jiang, Longquan and Abdullah, Rana and others}, booktitle={International Conference on Applications of Natural Language to Information Systems}, pages={95--110}, year={2025}, organization={Springer} } 论文链接:https://link.springer.com/chapter/10.1007/978-3-031-97141-9_7 GitHub仓库:https://github.com/s-nlp/ShortPathQA
提供机构:
s-nlp
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作