Felix6326727/beyond-accuracy-code-comprehension

Name: Felix6326727/beyond-accuracy-code-comprehension
Creator: Felix6326727
Published: 2026-03-24 12:24:14
License: 暂无描述

Hugging Face2026-03-24 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/Felix6326727/beyond-accuracy-code-comprehension

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 task_categories: - text-classification language: - code tags: - code-comprehension - llm-evaluation - software-metrics - input-output-prediction - code-understanding - benchmark pretty_name: "Beyond Accuracy: Code Comprehension" size_categories: - 10K<n<100K dataset_info: features: - name: sample_id dtype: string - name: code dtype: string - name: genome dtype: string - name: io_pairs dtype: string - name: correct_io_input dtype: string - name: correct_io_output dtype: string - name: incorrect_io_input dtype: string - name: incorrect_io_output dtype: string splits: - name: test num_examples: 12584 --- # Beyond Accuracy: Code Comprehension Dataset Dataset for the paper **"Beyond Accuracy: Characterizing Code Comprehension Capabilities in (Large) Language Models"** by Machtle, Serr, Loose & Eisenbarth (University of Luebeck). [[Paper]](https://arxiv.org/abs/2601.12951) | [[Code]](https://github.com/UzL-ITS/code-comprehension-capabilities-llms/) ## Task **Binary I/O consistency**: given a Python program `p`, an input `x`, and a candidate output `y`, determine whether `y` is the correct output of running `p(x)`. Each sample contains a **correct** I/O pair (label=1) and an **incorrect** I/O pair (label=0). Incorrect pairs are generated via *in-program shuffling* — pairing an input with the output of a *different* input to the same program — preserving lexical and stylistic characteristics while being semantically wrong. ## Quick Start ```python from datasets import load_dataset ds = load_dataset("Felix6326727/beyond-accuracy-code-comprehension", split="test") sample = ds[0] print(sample["code"][:200]) print(f"Correct: {sample['correct_io_input']!r} -> {sample['correct_io_output']!r}") print(f"Incorrect: {sample['incorrect_io_input']!r} -> {sample['incorrect_io_output']!r}") print(f"GPT-OSS 120B success: {sample['llm_gpt_oss_120b_success']}") print(f"Cyclomatic complexity: {sample['metric_cyclomatic_complexity']}") ``` ## Dataset Summary | | | |---|---| | **Columns** | 249 | | **Source** | Python subset of [Project CodeNet](https://github.com/IBM/Project_CodeNet) | | **I/O generation** | Type-aware fuzzing with hill-climbing type inference | | **Models evaluated** | 5 LLMs | | **Code metrics** | 224 static analysis features | ## Column Groups ### Core Columns (10) | Column | Type | Description | |---|---|---| | `sample_id` | string | Unique identifier (`{problem_id}.{solution_id}`) | | `code` | string | Python source code | | `source_file` | string | Original CodeNet file path | | `genome` | string | Inferred input type signature (e.g. `"is"` = integer + string) | | `io_pairs` | string | JSON array of all generated `[input, output]` pairs | | `num_io_pairs` | int | Number of I/O pairs generated | | `correct_io_input` | string | Input for the correct I/O test case | | `correct_io_output` | string | Expected output (ground truth) | | `incorrect_io_input` | string | Input for the incorrect I/O test case | | `incorrect_io_output` | string | Shuffled (wrong) output | ### LLM Evaluation Columns (15) Per-model results from the binary I/O consistency evaluation. Each model has 3 columns: | Column pattern | Type | Description | |---|---|---| | `llm_{model}_success` | bool | `True` if the model answered all test cases correctly for this sample | | `llm_{model}_num_correct` | int | Number of test cases answered correctly (out of `num_total`) | | `llm_{model}_num_total` | int | Total test cases for this sample (typically 2: one correct, one incorrect) | ### Code Metric Columns (224) All prefixed with `metric_`. Values are floats (or null if unavailable). **Size & Complexity (67 columns)** — includes `cyclomatic_complexity`, `loc`, `sloc`, `lloc`, `maintainability_index`, `code_length`, Halstead metrics (`h1`, `h2`, `N1`, `N2`, `vocabulary`, `length`, `volume`, `difficulty`, `effort`, `bugs`), `num_branches`, `num_loops`, `num_identifiers`, `num_literals`, `num_data_flows`, `parameter_count`, `variable_count`, and more. **AST / Graph Structure (118 columns)** — `metric_graph_nodes_*` columns counting occurrences of each AST node type: `if_statement`, `for_statement`, `call`, `assignment`, `binary_operator`, `identifier`, `block`, etc. Also includes graph-level metrics: `num_nodes`, `num_edges`, `density`, `diameter`, `average_shortest_path_length`, `average_clustering`. **Opcode Statistics (39 columns)** — Python bytecode features: `num_opcodes`, `sum_opcodes`, `avg_opcode_count`, `min_opcode_count`, `max_opcode_count`, individual opcode counts (`opcode_1`, `opcode_83`, ...), `opcodes_used0`–`opcodes_used3`, and `top_0_opcode_name` through `top_19_opcode_name`. ## Data Generation Pipeline ``` Python files (CodeNet) | v hill_climb.py ─── infer input types ("genome") via coverage-guided search | v fuzzer.py ──────── generate & shrink minimal I/O pairs | v export_io.py ───── create correct + incorrect (shuffled) I/O pairs | v This dataset ``` See the [GitHub repository](https://github.com/UzL-ITS/code-comprehension-capabilities-llms/) for the full pipeline code. ## Citation ```bibtex @article{machtle2025beyond, title={Beyond Accuracy: Characterizing Code Comprehension Capabilities in (Large) Language Models}, author={Machtle, Felix and Serr, Jan-Niclas and Loose, Nils and Eisenbarth, Thomas}, journal={arXiv preprint arXiv:2601.12951}, year={2025} } ``` ## License Derived from [Project CodeNet](https://github.com/IBM/Project_CodeNet) (Apache 2.0).

提供机构：

Felix6326727

5,000+

优质数据集

54 个

任务类型

进入经典数据集