ASSERT-KTH/agentic-evals-artifacts

Name: ASSERT-KTH/agentic-evals-artifacts
Creator: ASSERT-KTH
Published: 2026-03-20 16:35:56
License: 暂无描述

Hugging Face2026-03-20 更新2026-04-05 收录

下载链接：

https://hf-mirror.com/datasets/ASSERT-KTH/agentic-evals-artifacts

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-4.0 language: - en - code tags: - swe-bench - code-repair - agents - evaluation pretty_name: "On Randomness in Agentic Evals — Artifacts" arxiv: 2602.07150 --- # On Randomness in Agentic Evals — Results This dataset contains the trajectory and evaluation results from the paper [On Randomness in Agentic Evals](https://arxiv.org/abs/2602.07150). Agents are benchmarked on [SWE-bench Verified](https://www.swebench.com/) across different scaffolds, models, and temperatures, with 10 independent runs per setting to enable pass@k and variance analysis. ## Downloading the Data **Option 1 — HuggingFace CLI:** ```bash pip install huggingface-hub huggingface-cli download ASSERT-KTH/agentic-evals-artifacts --repo-type dataset --local-dir . ``` **Option 2 — Python:** ```python from huggingface_hub import snapshot_download snapshot_download( repo_id="ASSERT-KTH/agentic-evals-artifacts", repo_type="dataset", local_dir=".", ) ``` **Option 3 — Git (requires [git-lfs](https://git-lfs.com)):** ```bash git lfs install git clone https://huggingface.co/datasets/ASSERT-KTH/agentic-evals-artifacts ``` ## Directory Structure ``` {scaffold}-{model}/ # e.g. nano-agent-Qwen_Qwen3-32B {scaffold}-{model}__temp0/ # same model, temperature=0 (deterministic) {run_dir}/ # e.g. run_0, run_1, ... (10 runs per setting) <trajectories> # scaffold-specific JSONL (see below) <results>.json # SWE-bench evaluation results ``` **Top-level naming convention:** - `{scaffold}` — the agent framework: `nano-agent` or `r2e-gym` - `{model}` — HuggingFace model ID with `/` replaced by `_` (e.g. `Qwen_Qwen3-32B`) - `__temp0` suffix — runs at temperature 0 (greedy decoding); absent means temperature 0.6 ## File Formats ### nano-agent runs Each `run_N/` directory contains: | File | Description | |------|-------------| | `detailed_predictions.jsonl` | One record per instance. Contains full prompt/completion messages, the generated patch, exit reason, and token usage. | | `preds.jsonl` | Lightweight predictions file (instance_id + patch). | | `*.json` | SWE-bench evaluation results (see below). | ### r2e-gym runs Each run directory (named `traj_{model}_run_N/`) contains: | File | Description | |------|-------------| | `*.jsonl` (trajectories) | One record per instance. Contains `trajectory_steps` (thought, action, observation, token counts), `output_patch`, and `reward`. | | `*_swebv_eval_*.json` | SWE-bench evaluation results (see below). | | `*.json` (predictions) | Raw patch predictions (`instance_id`, `model_patch`). | ### SWE-bench results JSON The `*_swebv_eval_*.json` files follow the standard SWE-bench harness output format: ```json { "resolved_ids": ["django__django-10880", ...], "unresolved_ids": [...], "resolved_instances": 42, "total_instances": 500, ... } ``` ## Models and Scaffolds | Directory prefix | Scaffold | Model | |-----------------|----------|-------| | `nano-agent-Qwen_Qwen3-32B` | nano-agent | Qwen/Qwen3-32B | | `nano-agent-mistral_devstral-2512` | nano-agent | mistral/devstral-2512 | | `nano-agent-agentica-org_DeepSWE-Preview` | nano-agent | agentica-org/DeepSWE-Preview | | `r2e-gym-Qwen_Qwen3-32B` | r2e-gym | Qwen/Qwen3-32B | | `r2e-gym-mistral_devstral-2512` | r2e-gym | mistral/devstral-2512 | | `r2e-gym-agentica-org__DeepSWE-preview` | r2e-gym | agentica-org/DeepSWE-Preview | ## Citation If you use this data, please cite: ```bibtex @article{bjarnason2026randomness, title={On Randomness in Agentic Evals}, author={Bjarnason, Bjarni Haukur and Silva, Andr{\'e} and Monperrus, Martin}, journal={arXiv preprint arXiv:2602.07150}, year={2026} } ```

提供机构：

ASSERT-KTH

5,000+

优质数据集

54 个

任务类型

进入经典数据集