Turbulence_Benchmark-v2

Name: Turbulence_Benchmark-v2
Creator: Shahin Honarvar; Alastair Donaldson
License: 暂无描述

IEEE2026-04-17 收录

下载链接：

https://ieee-dataport.org/documents/turbulencebenchmark-v2

下载链接

链接失效反馈

官方服务：

资源简介：

Turbulence is a new benchmark and automated testing framework based on the question neighbourhood approach for systematically evaluating the accuracy (the overall rate of correctness across all generated outputs), correctness potential (whether the LLM produces at least one correct output for a given input), and consistent correctness (the model\u2019s ability to consistently produce correct outputs for the same input across successive generations) of instruction-tuned large language models (LLMs) for code generation.Turbulence consists of a large set of natural language question templates\u2014each a parameterised programming problem that can be instantiated in many different forms. Each template is paired with a test oracle that determines whether a code solution returned by an LLM is correct. This allows the generation of a neighbourhood of closely related questions from a single template, enabling fine-grained assessment of model behaviour across similar tasks.Turbulence systematically identifies cases where an LLM can solve some instances within a neighbourhood but fails to generalise across the entire set. By employing accuracy, correctness potential, and consistent correctness as core metrics, Turbulence provides a structured methodology to reveal model weaknesses and offers a more nuanced characterisation of LLM behaviour in structured problem spaces.This version (v2) is the direct update to version 1. In version 2, two new metrics and rigorous statistical analysis have been added to the source code.

提供机构：

Shahin Honarvar; Alastair Donaldson

5,000+

优质数据集

54 个

任务类型

进入经典数据集