NasimBrz/SearchBench
收藏Hugging Face2024-06-13 更新2024-06-29 收录
下载链接:
https://hf-mirror.com/datasets/NasimBrz/SearchBench
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc
task_categories:
- question-answering
language:
- en
tags:
- code
- math
- reasoning
- combinatorial_problems
- optimization
pretty_name: SearchBench
size_categories:
- 1K<n<10K
---
# Dataset Card for SearchBench
## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-instances)
- [Additional Information](#additional-information)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
## Dataset Description
- **Repository:** https://github.com/NasimBorazjani/Navigating_Labyrinth.git
- **Paper:**
- **Point of Contact:** nasimborazjani@berkeley.edu & nb11@williams.edu
### Dataset Summary
SearchBench is a benchmark designed to evaluate Language Models' (LLMs) ability to solve state-based problems that require combinatorial search and backtracking. SearchBench problems require a systematic exploration of action paths and backtracking to feasible states, which poses a significant challenge for LLMs to solve end-to-end, due to their autoregressive next-token prediction architecture.
The dataset is composed of five problem categories: puzzles, subset sum, sorting, pathfinding, and under-determined systems. These categories are further divided into 11 distinct problem types. Each problem type is adapted from NP-hard puzzles and combinatorial problems, with modifications made to the rules and constraints of each problem. The dataset contains approximately 100 instances of each problem type with varying levels of difficulty.
Automated pipelines are implemented for each problem type to generate an unlimited number of solvable instances and to assess the feasibility, correctness, and optimality of the solutions generated by the LLMs.
This benchmark provides insights into LLMs' capacity to implement new algorithms to solve complex problems. It also investigates the non-linear reasoning capability of LLMs to solve search problems end-to-end using only text.
### Languages
The text in the dataset is in English. The associated BCP-47 code is `en`.
## Dataset Structure
### Data Instances
Each instance contains a natural language description of a problem and the optimal solution for the problem calculated using an A* algorithm with provable admissible and consistent heuristic implementation.
```python
{
“problem_statement”: “In the game 'Sort the Chars', we are given a table of n by m dimensions. This table contains n words, each with m characters, except for the first word which has m - 1 characters. Each character is written on a separate tile. The objective of the game is to rearrange the characters such that row i spells the i-th word in the list, with the blank tile ('_') placed in the top left corner of the board in the end. We can rearrange the tiles by swapping the blank space with any of its 4 diagonal neighboring tiles. Given the list of words and initial state of the board below, where the black space is represented as '_', what is the shortest list of swap actions (reported in python syntax) that can sort the board into the given list of target words? The list must only include the 4 diagonal swap directions: up-right, down-right, up-left, or down-left, representing the direction in ehich the blank space was swpped in. Target words: cam, hill, pray, doer The initial board: [['i', 'c', 'a', 'm'], ['h', 'p', 'l', 'o'], ['_', 'r', 'a', 'y'], ['d', 'l', 'e', 'r']]”,
“opt_solution”: "['up-right', 'down-right', 'down-left', 'up-left', 'up-right', 'down-right', 'up-right', 'up-left', 'down-left', 'down-left', 'down-right', 'up-right', 'up-right', 'up-left', 'down-left', 'up-left']",
}
```
### Data Fields
Each instance in the dataset comes with several additional fields. These fields offer insights about the problem and allow for automatic evaluation of the solutions generated by the LLMs across multiple dimensions. For a more detailed description of the fields for each instance, please refer to the supplementary material of the paper.
- diff_sorted_id: A unique identifier for each problem, sorted by difficulty level within a specific problem type.
- problem_statement: The description of the problem to be solved, given to language models.
- problem_type: The type of problem, out of 11 possible types.
- problem_category: The category of the problem, out of five possible categories.
- relative_diff_score: A score indicating the problem's difficulty relative to other problems of the same type.
- opt_solution: A list of actions leading to the goal state with the minimum cost.
- opt_solution_cost: The cost of the optimal solution for the problem.
- opt_solution_compute_t: The time taken by the A* implementation to solve the problem.
- solution_depth: The number of actions required to reach the goal state with minimum cost.
- max_successor_states: The maximum number of successor states that can be reached from any state in the problem.
- num_vars_per_state: An upper limit on the number of variables in each state of the problem.
- is_feasible_args: Variables that must be passed to the 'is_feasible' function to check if a solution adheres to the problem's rules and constraints.
- is_correct_args: Variables that must be passed to the 'is_correct' function to evaluate the correctness of a solution.
- A*_args: Variables that must be passed to the A* implementation to obtain the optimal solution for the problem.
## Additional Information
### Licensing Information
The SearchBench dataset is licensed under the [CC BY-SA](https://creativecommons.org/licenses/by-sa/4.0/).
### Citation Information
```bibtex
```
提供机构:
NasimBrz
原始信息汇总
SearchBench 数据集概述
数据集描述
数据集摘要
SearchBench 是一个用于评估语言模型(LLMs)解决基于状态问题的基准测试,这些问题需要组合搜索和回溯。数据集包含五个问题类别:谜题、子集和、排序、路径查找和欠定系统,进一步细分为11种不同的问题类型。每种问题类型都改编自NP难的谜题和组合问题,并进行了规则和约束的修改。数据集包含约100个每种问题类型的实例,难度各异。
语言
数据集中的文本为英语,对应的BCP-47代码为 en。
数据集结构
数据实例
每个实例包含一个自然语言描述的问题及其使用A*算法计算的最优解。
数据字段
每个实例包含以下字段:
diff_sorted_id: 每个问题的唯一标识符,按难度级别排序。problem_statement: 问题的描述。problem_type: 问题类型,共11种。problem_category: 问题类别,共5种。relative_diff_score: 问题的相对难度评分。opt_solution: 达到目标状态的最小成本动作列表。opt_solution_cost: 最优解的成本。opt_solution_compute_t: A*算法解决问题的时间。solution_depth: 达到目标状态所需的最小动作数。max_successor_states: 从任何状态可以到达的最大后继状态数。num_vars_per_state: 每个状态中的变量上限。is_feasible_args: 传递给is_feasible函数以检查解是否符合问题规则和约束的变量。is_correct_args: 传递给is_correct函数以评估解正确性的变量。A*_args: 传递给A*实现以获得问题最优解的变量。
附加信息
许可信息
SearchBench 数据集基于 CC BY-SA 许可。



