graphwalks
收藏魔搭社区2026-05-08 更新2025-04-19 收录
下载链接:
https://modelscope.cn/datasets/openai-mirror/graphwalks
下载链接
链接失效反馈官方服务:
资源简介:
# GraphWalks: a multi hop reasoning long context benchmark
In Graphwalks, the model is given a graph represented by its edge list and asked to perform an operation.
Example prompt:
```
You will be given a graph as a list of directed edges. All nodes are at least degree 1.
You will also get a description of an operation to perform on the graph.
Your job is to execute the operation on the graph and return the set of nodes that the operation results in.
If asked for a breadth-first search (BFS), only return the nodes that are reachable at that depth, do not return the starting node.
If asked for the parents of a node, only return the nodes that have an edge leading to the given node, do not return the given node itself.
The graph has the following edges:
uvwx -> alke
abcd -> uvwx
abcd -> efgh
efgh -> uvwx
Example 1:
Operation:
Perform a BFS from node abcd with depth 1.
Final Answer: [uvwx, efgh]
```
## Data schema
|column|description|
|------|-----------|
|`prompt`| A 3-shot example followed by the graph and the operation to be performed. This is mean to be supplied to the model as a user message.|
|`answer`| A list of node ids that the model should respond with.|
|`prompt_chars`| The number of characters in the prompt.|
|`problem_type`| Either `bfs` or `parents` for the graph operation requested|
## Extraction and Grading
We use the following code to extract answers from responses
```python
def get_list(self, response: str) -> tuple[list[str], bool]:
# get the very last line of the response
line = response.split("\n")[-1]
# check if formatted correctly
if "Final Answer:" not in line:
return [], True
list_part = re.search(r"Final Answer: ?\[.*\]", line)
if list_part:
result_list = list_part.group(0).strip("[]").split(",")
# if the list was empty, then get [] not [""]
result_list = [item.strip() for item in result_list if item.strip()]
return result_list, False
else:
return [], True
```
We grade each example with the following
```python
n_overlap = len(sampled_set & truth_set)
recall = n_overlap / n_golden if n_golden > 0 else 0
precision = n_overlap / n_sampled if n_sampled > 0 else 0
f1 = 2 * (recall * precision) / (recall + precision) if recall + precision > 0 else 1
```
## OpenAI results
Please refer to the [GPT 4.1 blog post](https://openai.com/index/gpt-4-1/).
## Changelong
- 4/12/2025: Initial dataset published
- 2/27/26: Bugfix: A bug during generation led to 24/400 parents samples in `graphwalks_128k_and_shorter.parquet` to contain the incorrect ground truth - the root node was inadvertently included.
Additionally, in BFS there was ambiguity in the prompt - during normal BFS, revisited nodes are not added to the frontier.
However, as worded, the model is asked to "return the nodes that are reachable at that depth", which would imply including revisited nodes. The prompt has been modified to specify that only nodes at exactly the desired depth should be returned.
Thank you to the Claude Opus 4.6 system card for pointing out the issue in the parent samples!
# GraphWalks:面向多跳推理(multi-hop reasoning)的长上下文基准测试集
在GraphWalks数据集场景中,模型会获得一份以边列表(edge list)形式表示的图结构,并被要求执行指定操作。
### 示例提示
您将获得一份以有向边列表形式表示的图结构,所有节点的度数均至少为1。
同时您还将收到一份针对该图结构的操作说明。
您的任务是在该图上执行指定操作,并返回操作结果对应的节点集合。
若要求执行广度优先搜索(BFS),仅返回指定深度上可达的节点,无需包含起始节点。
若要求查询某节点的父节点,仅返回所有指向该给定节点的边所对应的节点,无需包含该给定节点本身。
该图包含如下边:
uvwx -> alke
abcd -> uvwx
abcd -> efgh
efgh -> uvwx
示例1:
操作要求:
从节点abcd出发执行深度为1的广度优先搜索。
最终答案:[uvwx, efgh]
## 数据模式
|字段名|描述|
|------|-----------|
|`prompt`| 包含3个示例样本,其后附带目标图结构与待执行操作,该字段将以用户消息的形式提供给模型。|
|`answer`| 模型应返回的节点ID集合列表。|
|`prompt_chars`| 提示文本的总字符数。|
|`problem_type`| 指定的图操作类型,可选值为`bfs`(广度优先搜索)或`parents`(父节点查询)。|
## 答案提取与评分规则
我们采用以下代码从模型回复中提取答案:
python
def get_list(self, response: str) -> tuple[list[str], bool]:
# 获取回复的最后一行文本
line = response.split("
")[-1]
# 检查回复格式是否合规
if "Final Answer:" not in line:
return [], True
list_part = re.search(r"Final Answer: ?[.*]", line)
if list_part:
result_list = list_part.group(0).strip("[]").split(",")
# 过滤空字符串条目,避免返回无效空列表
result_list = [item.strip() for item in result_list if item.strip()]
return result_list, False
else:
return [], True
我们采用以下指标对每个示例进行评分:
python
# 计算预测节点集合与标准答案集合的重叠节点数
n_overlap = len(sampled_set & truth_set)
# 计算召回率:若标准答案集合非空,则为重叠节点数与标准答案节点数的比值,否则为0
recall = n_overlap / n_golden if n_golden > 0 else 0
# 计算精确率:若预测集合非空,则为重叠节点数与预测节点数的比值,否则为0
precision = n_overlap / n_sampled if n_sampled > 0 else 0
# 计算F1值:召回率与精确率的调和平均,若二者之和为0则为0
f1 = 2 * (recall * precision) / (recall + precision) if recall + precision > 0 else 0
## OpenAI模型测试结果
请参考[GPT-4.1官方博客](https://openai.com/index/gpt-4-1/).
提供机构:
maas
创建时间:
2025-04-22
搜集汇总
数据集介绍

背景与挑战
背景概述
GraphWalks是一个用于多跳推理和长上下文环境测试的基准数据集,包含图的边列表和操作指令,要求模型执行如广度优先搜索(BFS)或查找父节点等操作,并返回正确节点集合。数据集通过精确的评估方法(如F1分数)来衡量模型性能,并已修复早期版本中的一些错误。
以上内容由遇见数据集搜集并总结生成



