JetBrains-Research/REval

Name: JetBrains-Research/REval
Creator: JetBrains-Research
Published: 2026-03-02 14:44:35
License: 暂无描述

Hugging Face2026-03-02 更新2026-04-05 收录

下载链接：

https://hf-mirror.com/datasets/JetBrains-Research/REval

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit task_categories: - text-generation - question-answering language: - en - code tags: - code - program-analysis - runtime-behavior - execution-traces - code-reasoning - python pretty_name: "REval: Reasoning Evaluation" size_categories: - n<1K dataset_info: - config_name: problems features: - name: task_id dtype: string - name: code dtype: string - name: entry_point dtype: string - name: test dtype: string - name: inputs sequence: string - name: outputs sequence: string splits: - name: test num_examples: 154 - config_name: tasks features: - name: task_id dtype: string - name: idx dtype: int32 - name: tasks dtype: string splits: - name: test num_examples: 154 - config_name: executions features: - name: task_id dtype: string - name: idx dtype: int32 - name: input_idx dtype: int32 - name: problem_type dtype: string - name: input dtype: string - name: expected_output dtype: string - name: actual_output dtype: string - name: status dtype: string - name: trace sequence: int32 - name: coverage sequence: int32 - name: num_states dtype: int32 - name: code_hash dtype: string - name: error dtype: string splits: - name: test num_examples: 694 - config_name: states features: - name: task_id dtype: string - name: idx dtype: int32 - name: input_idx dtype: int32 - name: states dtype: string splits: - name: test num_examples: 694 configs: - config_name: problems data_files: - split: test path: "data/problems.jsonl" default: true - config_name: tasks data_files: - split: test path: "data/tasks.jsonl" - config_name: executions data_files: - split: test path: "data/executions.jsonl" - config_name: states data_files: - split: test path: "data/states.jsonl" --- # REval: Reasoning Runtime Behavior of a Program with LLM > **Disclaimer:** We are not the authors of the REval benchmark. This upload is a convenience repackaging of the original dataset with precomputed execution traces, variable states, and ground truth answers to make the benchmark easier to use programmatically. The original benchmark was created by Junkai Chen et al. and is available at [github.com/r-eval/REval](https://github.com/r-eval/REval/). Please cite the original paper if you use this data. REval is a benchmark for evaluating Large Language Models' ability to reason about the **runtime behavior** of Python programs. > **Reasoning Runtime Behavior of a Program with LLM: How Far Are We?** > Junkai Chen, Zhiyuan Pan, Xing Hu, Zhenhao Li, Ge Li, Xin Xia > ICSE 2025 > [Paper](https://doi.org/10.1109/ICSE55347.2025.00087) | [GitHub](https://github.com/r-eval/REval/) ## Dataset Summary | | | |---|---| | **Problems** | 154 (85 HumanEval + 69 ClassEval) | | **Test-case executions** | 694 (with full execution traces) | | **Reasoning tasks** | Coverage, Path, State, Output, Consistency | | **Ground truth** | Line coverage, execution traces, variable states at every step | | **License** | MIT | ## Reasoning Tasks 1. **Coverage** -- Predict whether a specific line of code is executed for a given input 2. **Path** -- Determine the next line that will be executed after a given line 3. **State** -- Infer variable values at specific execution points 4. **Output** -- Complete test code based on expected execution behavior 5. **Consistency** -- Combined score measuring consistency across all four tasks ## Configurations | Config | Records | Description | |--------|---------|-------------| | `problems` (default) | 154 | Problem definitions: code, inputs, expected outputs | | `tasks` | 154 | Task specifications: which lines/variables to query per input | | `executions` | 694 | Execution traces and line coverage per (problem, input) pair | | `states` | 694 | Variable states at each executed line | ## Usage ```python from datasets import load_dataset # Load problem definitions (default config) problems = load_dataset("r-eval/REval", "problems", split="test") print(f"{len(problems)} problems") print(problems[0]["task_id"]) # "DREval/0" print(problems[0]["entry_point"]) # "has_close_elements" # Load execution traces executions = load_dataset("r-eval/REval", "executions", split="test") print(executions[0]["trace"]) # [11, 12, 13, 12, 13, 14, 15, ...] print(executions[0]["coverage"]) # [11, 12, 13, 14, 15, 16] # Load variable states (states field is a JSON string -- parse it) import json states = load_dataset("r-eval/REval", "states", split="test") state_list = json.loads(states[0]["states"]) # Each state: {"lineno": 0, "locals": {"var": {"__type__": "int", "__value__": 1}}} # Load task specifications (tasks field is a JSON string) tasks = load_dataset("r-eval/REval", "tasks", split="test") task_list = json.loads(tasks[0]["tasks"]) # Each task: {"input_idx": 0, "task": [{"lineno": 17, "var": "distance"}, ...], "output_pred": "..."} ``` ## Data Fields ### `problems` config | Field | Type | Description | |-------|------|-------------| | `task_id` | string | Unique identifier, e.g. `"DREval/0"` | | `code` | string | Complete Python source code (signature + docstring + solution) | | `entry_point` | string | Function or class name | | `test` | string or null | Unittest code for ClassEval problems; null for HumanEval | | `inputs` | list[string] | Test inputs as Python expressions | | `outputs` | list[string] | Expected outputs as strings | **Problem types:** - **HumanEval** (idx 0--84): Standalone functions. `test` is null, `outputs` is non-empty. - **ClassEval** (idx 85--153): OOP classes. `test` contains unittest code, `outputs` is empty. ### `tasks` config | Field | Type | Description | |-------|------|-------------| | `task_id` | string | Unique identifier | | `idx` | int | Problem index | | `tasks` | string (JSON) | JSON-encoded list of per-input task definitions | Each entry in the parsed `tasks` list contains: - `input_idx` (int): Index into the problem's inputs/outputs arrays - `task` (list[object]): Variable queries -- each has `lineno` (1-indexed) and `var` (variable name) - `output_pred` (string): Output prediction template (e.g. `"assert func(args) == ??"`) ### `executions` config | Field | Type | Description | |-------|------|-------------| | `task_id` | string | Problem identifier | | `idx` | int | Problem index | | `input_idx` | int | Which test input was used | | `problem_type` | string | `"humaneval"` or `"classeval"` | | `input` | string | The specific input expression | | `expected_output` | string | Expected output | | `actual_output` | string | Actual output from execution | | `status` | string | `"ok"` or `"error"` | | `trace` | list[int] | 0-indexed line execution sequence | | `coverage` | list[int] | Sorted unique executed lines (0-indexed) | | `num_states` | int | Number of state snapshots captured | | `code_hash` | string | SHA-256 of the source code | | `error` | string or null | Error message if `status="error"`, null otherwise | ### `states` config | Field | Type | Description | |-------|------|-------------| | `task_id` | string | Problem identifier | | `idx` | int | Problem index | | `input_idx` | int | Which test input was used | | `states` | string (JSON) | JSON-encoded list of state objects | Each state object in the parsed list: - `lineno` (int): 0-indexed line number - `locals` (dict): Variable name to typed value envelope (`{"__type__": "int", "__value__": 42}`) - `return` (optional): Return value in the same envelope format - `exception` (optional): Exception info if one was raised **Supported value types in envelopes:** `int`, `float`, `bool`, `str`, `NoneType`, `list`, `tuple`, `set`, `dict`, `Nil` (uninitialized), `numpy.ndarray`, `datetime.datetime`, and custom objects. Special float values: `"nan"`, `"inf"`, `"-inf"`. ## Line Number Conventions - **`executions` config** (`trace`, `coverage`): **0-indexed** line numbers - **`states` config** (`lineno`): **0-indexed** line numbers - **`tasks` config** (`lineno` in task queries): **1-indexed** line numbers (for use in prompts) ## Known Issues - Problems `DREval/117` and `DREval/149` import `gensim` (not included in dependencies). Their ground truth records have `status="error"` with empty traces. - Arrows to the next executed lines for ClassEval do not take into account test code. ## Citation If you use this dataset, please cite: ```bibtex @inproceedings{chen2025reval, title = {Reasoning Runtime Behavior of a Program with LLM: How Far Are We?}, author = {Junkai Chen and Zhiyuan Pan and Xing Hu and Zhenhao Li and Ge Li and Xin Xia}, booktitle = {Proceedings of the 47th IEEE/ACM International Conference on Software Engineering (ICSE)}, year = {2025}, doi = {10.1109/ICSE55347.2025.00087} } ``` ## Source - **Repository:** [github.com/r-eval/REval](https://github.com/r-eval/REval/) - **Paper:** [Reasoning Runtime Behavior of a Program with LLM: How Far Are We?](https://doi.org/10.1109/ICSE55347.2025.00087) (ICSE 2025) - **License:** MIT

提供机构：

JetBrains-Research

5,000+

优质数据集

54 个

任务类型

进入经典数据集