JetBrains-Research/REval
收藏Hugging Face2026-03-02 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/JetBrains-Research/REval
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
task_categories:
- text-generation
- question-answering
language:
- en
- code
tags:
- code
- program-analysis
- runtime-behavior
- execution-traces
- code-reasoning
- python
pretty_name: "REval: Reasoning Evaluation"
size_categories:
- n<1K
dataset_info:
- config_name: problems
features:
- name: task_id
dtype: string
- name: code
dtype: string
- name: entry_point
dtype: string
- name: test
dtype: string
- name: inputs
sequence: string
- name: outputs
sequence: string
splits:
- name: test
num_examples: 154
- config_name: tasks
features:
- name: task_id
dtype: string
- name: idx
dtype: int32
- name: tasks
dtype: string
splits:
- name: test
num_examples: 154
- config_name: executions
features:
- name: task_id
dtype: string
- name: idx
dtype: int32
- name: input_idx
dtype: int32
- name: problem_type
dtype: string
- name: input
dtype: string
- name: expected_output
dtype: string
- name: actual_output
dtype: string
- name: status
dtype: string
- name: trace
sequence: int32
- name: coverage
sequence: int32
- name: num_states
dtype: int32
- name: code_hash
dtype: string
- name: error
dtype: string
splits:
- name: test
num_examples: 694
- config_name: states
features:
- name: task_id
dtype: string
- name: idx
dtype: int32
- name: input_idx
dtype: int32
- name: states
dtype: string
splits:
- name: test
num_examples: 694
configs:
- config_name: problems
data_files:
- split: test
path: "data/problems.jsonl"
default: true
- config_name: tasks
data_files:
- split: test
path: "data/tasks.jsonl"
- config_name: executions
data_files:
- split: test
path: "data/executions.jsonl"
- config_name: states
data_files:
- split: test
path: "data/states.jsonl"
---
# REval: Reasoning Runtime Behavior of a Program with LLM
> **Disclaimer:** We are not the authors of the REval benchmark. This upload is a convenience repackaging of the original dataset with precomputed execution traces, variable states, and ground truth answers to make the benchmark easier to use programmatically. The original benchmark was created by Junkai Chen et al. and is available at [github.com/r-eval/REval](https://github.com/r-eval/REval/). Please cite the original paper if you use this data.
REval is a benchmark for evaluating Large Language Models' ability to reason about the **runtime behavior** of Python programs.
> **Reasoning Runtime Behavior of a Program with LLM: How Far Are We?**
> Junkai Chen, Zhiyuan Pan, Xing Hu, Zhenhao Li, Ge Li, Xin Xia
> ICSE 2025
> [Paper](https://doi.org/10.1109/ICSE55347.2025.00087) | [GitHub](https://github.com/r-eval/REval/)
## Dataset Summary
| | |
|---|---|
| **Problems** | 154 (85 HumanEval + 69 ClassEval) |
| **Test-case executions** | 694 (with full execution traces) |
| **Reasoning tasks** | Coverage, Path, State, Output, Consistency |
| **Ground truth** | Line coverage, execution traces, variable states at every step |
| **License** | MIT |
## Reasoning Tasks
1. **Coverage** -- Predict whether a specific line of code is executed for a given input
2. **Path** -- Determine the next line that will be executed after a given line
3. **State** -- Infer variable values at specific execution points
4. **Output** -- Complete test code based on expected execution behavior
5. **Consistency** -- Combined score measuring consistency across all four tasks
## Configurations
| Config | Records | Description |
|--------|---------|-------------|
| `problems` (default) | 154 | Problem definitions: code, inputs, expected outputs |
| `tasks` | 154 | Task specifications: which lines/variables to query per input |
| `executions` | 694 | Execution traces and line coverage per (problem, input) pair |
| `states` | 694 | Variable states at each executed line |
## Usage
```python
from datasets import load_dataset
# Load problem definitions (default config)
problems = load_dataset("r-eval/REval", "problems", split="test")
print(f"{len(problems)} problems")
print(problems[0]["task_id"]) # "DREval/0"
print(problems[0]["entry_point"]) # "has_close_elements"
# Load execution traces
executions = load_dataset("r-eval/REval", "executions", split="test")
print(executions[0]["trace"]) # [11, 12, 13, 12, 13, 14, 15, ...]
print(executions[0]["coverage"]) # [11, 12, 13, 14, 15, 16]
# Load variable states (states field is a JSON string -- parse it)
import json
states = load_dataset("r-eval/REval", "states", split="test")
state_list = json.loads(states[0]["states"])
# Each state: {"lineno": 0, "locals": {"var": {"__type__": "int", "__value__": 1}}}
# Load task specifications (tasks field is a JSON string)
tasks = load_dataset("r-eval/REval", "tasks", split="test")
task_list = json.loads(tasks[0]["tasks"])
# Each task: {"input_idx": 0, "task": [{"lineno": 17, "var": "distance"}, ...], "output_pred": "..."}
```
## Data Fields
### `problems` config
| Field | Type | Description |
|-------|------|-------------|
| `task_id` | string | Unique identifier, e.g. `"DREval/0"` |
| `code` | string | Complete Python source code (signature + docstring + solution) |
| `entry_point` | string | Function or class name |
| `test` | string or null | Unittest code for ClassEval problems; null for HumanEval |
| `inputs` | list[string] | Test inputs as Python expressions |
| `outputs` | list[string] | Expected outputs as strings |
**Problem types:**
- **HumanEval** (idx 0--84): Standalone functions. `test` is null, `outputs` is non-empty.
- **ClassEval** (idx 85--153): OOP classes. `test` contains unittest code, `outputs` is empty.
### `tasks` config
| Field | Type | Description |
|-------|------|-------------|
| `task_id` | string | Unique identifier |
| `idx` | int | Problem index |
| `tasks` | string (JSON) | JSON-encoded list of per-input task definitions |
Each entry in the parsed `tasks` list contains:
- `input_idx` (int): Index into the problem's inputs/outputs arrays
- `task` (list[object]): Variable queries -- each has `lineno` (1-indexed) and `var` (variable name)
- `output_pred` (string): Output prediction template (e.g. `"assert func(args) == ??"`)
### `executions` config
| Field | Type | Description |
|-------|------|-------------|
| `task_id` | string | Problem identifier |
| `idx` | int | Problem index |
| `input_idx` | int | Which test input was used |
| `problem_type` | string | `"humaneval"` or `"classeval"` |
| `input` | string | The specific input expression |
| `expected_output` | string | Expected output |
| `actual_output` | string | Actual output from execution |
| `status` | string | `"ok"` or `"error"` |
| `trace` | list[int] | 0-indexed line execution sequence |
| `coverage` | list[int] | Sorted unique executed lines (0-indexed) |
| `num_states` | int | Number of state snapshots captured |
| `code_hash` | string | SHA-256 of the source code |
| `error` | string or null | Error message if `status="error"`, null otherwise |
### `states` config
| Field | Type | Description |
|-------|------|-------------|
| `task_id` | string | Problem identifier |
| `idx` | int | Problem index |
| `input_idx` | int | Which test input was used |
| `states` | string (JSON) | JSON-encoded list of state objects |
Each state object in the parsed list:
- `lineno` (int): 0-indexed line number
- `locals` (dict): Variable name to typed value envelope (`{"__type__": "int", "__value__": 42}`)
- `return` (optional): Return value in the same envelope format
- `exception` (optional): Exception info if one was raised
**Supported value types in envelopes:** `int`, `float`, `bool`, `str`, `NoneType`, `list`, `tuple`, `set`, `dict`, `Nil` (uninitialized), `numpy.ndarray`, `datetime.datetime`, and custom objects. Special float values: `"nan"`, `"inf"`, `"-inf"`.
## Line Number Conventions
- **`executions` config** (`trace`, `coverage`): **0-indexed** line numbers
- **`states` config** (`lineno`): **0-indexed** line numbers
- **`tasks` config** (`lineno` in task queries): **1-indexed** line numbers (for use in prompts)
## Known Issues
- Problems `DREval/117` and `DREval/149` import `gensim` (not included in dependencies). Their ground truth records have `status="error"` with empty traces.
- Arrows to the next executed lines for ClassEval do not take into account test code.
## Citation
If you use this dataset, please cite:
```bibtex
@inproceedings{chen2025reval,
title = {Reasoning Runtime Behavior of a Program with LLM: How Far Are We?},
author = {Junkai Chen and Zhiyuan Pan and Xing Hu and Zhenhao Li and Ge Li and Xin Xia},
booktitle = {Proceedings of the 47th IEEE/ACM International Conference on Software Engineering (ICSE)},
year = {2025},
doi = {10.1109/ICSE55347.2025.00087}
}
```
## Source
- **Repository:** [github.com/r-eval/REval](https://github.com/r-eval/REval/)
- **Paper:** [Reasoning Runtime Behavior of a Program with LLM: How Far Are We?](https://doi.org/10.1109/ICSE55347.2025.00087) (ICSE 2025)
- **License:** MIT
提供机构:
JetBrains-Research



