graphwalks

Name: graphwalks
Creator: maas
Published: 2026-05-08 15:30:49
License: 暂无描述

魔搭社区2026-05-08 更新2025-04-19 收录

下载链接：

https://modelscope.cn/datasets/openai-mirror/graphwalks

下载链接

链接失效反馈

官方服务：

资源简介：

# GraphWalks: a multi hop reasoning long context benchmark In Graphwalks, the model is given a graph represented by its edge list and asked to perform an operation. Example prompt: ``` You will be given a graph as a list of directed edges. All nodes are at least degree 1. You will also get a description of an operation to perform on the graph. Your job is to execute the operation on the graph and return the set of nodes that the operation results in. If asked for a breadth-first search (BFS), only return the nodes that are reachable at that depth, do not return the starting node. If asked for the parents of a node, only return the nodes that have an edge leading to the given node, do not return the given node itself. The graph has the following edges: uvwx -> alke abcd -> uvwx abcd -> efgh efgh -> uvwx Example 1: Operation: Perform a BFS from node abcd with depth 1. Final Answer: [uvwx, efgh] ``` ## Data schema |column|description| |------|-----------| |`prompt`| A 3-shot example followed by the graph and the operation to be performed. This is mean to be supplied to the model as a user message.| |`answer`| A list of node ids that the model should respond with.| |`prompt_chars`| The number of characters in the prompt.| |`problem_type`| Either `bfs` or `parents` for the graph operation requested| ## Extraction and Grading We use the following code to extract answers from responses ```python def get_list(self, response: str) -> tuple[list[str], bool]: # get the very last line of the response line = response.split("\n")[-1] # check if formatted correctly if "Final Answer:" not in line: return [], True list_part = re.search(r"Final Answer: ?\[.*\]", line) if list_part: result_list = list_part.group(0).strip("[]").split(",") # if the list was empty, then get [] not [""] result_list = [item.strip() for item in result_list if item.strip()] return result_list, False else: return [], True ``` We grade each example with the following ```python n_overlap = len(sampled_set & truth_set) recall = n_overlap / n_golden if n_golden > 0 else 0 precision = n_overlap / n_sampled if n_sampled > 0 else 0 f1 = 2 * (recall * precision) / (recall + precision) if recall + precision > 0 else 1 ``` ## OpenAI results Please refer to the [GPT 4.1 blog post](https://openai.com/index/gpt-4-1/). ## Changelong - 4/12/2025: Initial dataset published - 2/27/26: Bugfix: A bug during generation led to 24/400 parents samples in `graphwalks_128k_and_shorter.parquet` to contain the incorrect ground truth - the root node was inadvertently included. Additionally, in BFS there was ambiguity in the prompt - during normal BFS, revisited nodes are not added to the frontier. However, as worded, the model is asked to "return the nodes that are reachable at that depth", which would imply including revisited nodes. The prompt has been modified to specify that only nodes at exactly the desired depth should be returned. Thank you to the Claude Opus 4.6 system card for pointing out the issue in the parent samples!

# GraphWalks：面向多跳推理（multi-hop reasoning）的长上下文基准测试集在GraphWalks数据集场景中，模型会获得一份以边列表（edge list）形式表示的图结构，并被要求执行指定操作。 ### 示例提示您将获得一份以有向边列表形式表示的图结构，所有节点的度数均至少为1。同时您还将收到一份针对该图结构的操作说明。您的任务是在该图上执行指定操作，并返回操作结果对应的节点集合。若要求执行广度优先搜索（BFS），仅返回指定深度上可达的节点，无需包含起始节点。若要求查询某节点的父节点，仅返回所有指向该给定节点的边所对应的节点，无需包含该给定节点本身。该图包含如下边： uvwx -> alke abcd -> uvwx abcd -> efgh efgh -> uvwx 示例1：操作要求：从节点abcd出发执行深度为1的广度优先搜索。最终答案：[uvwx, efgh] ## 数据模式 |字段名|描述| |------|-----------| |`prompt`| 包含3个示例样本，其后附带目标图结构与待执行操作，该字段将以用户消息的形式提供给模型。| |`answer`| 模型应返回的节点ID集合列表。| |`prompt_chars`| 提示文本的总字符数。| |`problem_type`| 指定的图操作类型，可选值为`bfs`（广度优先搜索）或`parents`（父节点查询）。| ## 答案提取与评分规则我们采用以下代码从模型回复中提取答案： python def get_list(self, response: str) -> tuple[list[str], bool]: # 获取回复的最后一行文本 line = response.split(" ")[-1] # 检查回复格式是否合规 if "Final Answer:" not in line: return [], True list_part = re.search(r"Final Answer: ?[.*]", line) if list_part: result_list = list_part.group(0).strip("[]").split(",") # 过滤空字符串条目，避免返回无效空列表 result_list = [item.strip() for item in result_list if item.strip()] return result_list, False else: return [], True 我们采用以下指标对每个示例进行评分： python # 计算预测节点集合与标准答案集合的重叠节点数 n_overlap = len(sampled_set & truth_set) # 计算召回率：若标准答案集合非空，则为重叠节点数与标准答案节点数的比值，否则为0 recall = n_overlap / n_golden if n_golden > 0 else 0 # 计算精确率：若预测集合非空，则为重叠节点数与预测节点数的比值，否则为0 precision = n_overlap / n_sampled if n_sampled > 0 else 0 # 计算F1值：召回率与精确率的调和平均，若二者之和为0则为0 f1 = 2 * (recall * precision) / (recall + precision) if recall + precision > 0 else 0 ## OpenAI模型测试结果请参考[GPT-4.1官方博客](https://openai.com/index/gpt-4-1/).

提供机构：

maas

创建时间：

2025-04-22

搜集汇总

数据集介绍