five

ASSERT-KTH/agentic-evals-artifacts

收藏
Hugging Face2026-03-20 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/ASSERT-KTH/agentic-evals-artifacts
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 language: - en - code tags: - swe-bench - code-repair - agents - evaluation pretty_name: "On Randomness in Agentic Evals — Artifacts" arxiv: 2602.07150 --- # On Randomness in Agentic Evals — Results This dataset contains the trajectory and evaluation results from the paper [On Randomness in Agentic Evals](https://arxiv.org/abs/2602.07150). Agents are benchmarked on [SWE-bench Verified](https://www.swebench.com/) across different scaffolds, models, and temperatures, with 10 independent runs per setting to enable pass@k and variance analysis. ## Downloading the Data **Option 1 — HuggingFace CLI:** ```bash pip install huggingface-hub huggingface-cli download ASSERT-KTH/agentic-evals-artifacts --repo-type dataset --local-dir . ``` **Option 2 — Python:** ```python from huggingface_hub import snapshot_download snapshot_download( repo_id="ASSERT-KTH/agentic-evals-artifacts", repo_type="dataset", local_dir=".", ) ``` **Option 3 — Git (requires [git-lfs](https://git-lfs.com)):** ```bash git lfs install git clone https://huggingface.co/datasets/ASSERT-KTH/agentic-evals-artifacts ``` ## Directory Structure ``` {scaffold}-{model}/ # e.g. nano-agent-Qwen_Qwen3-32B {scaffold}-{model}__temp0/ # same model, temperature=0 (deterministic) {run_dir}/ # e.g. run_0, run_1, ... (10 runs per setting) <trajectories> # scaffold-specific JSONL (see below) <results>.json # SWE-bench evaluation results ``` **Top-level naming convention:** - `{scaffold}` — the agent framework: `nano-agent` or `r2e-gym` - `{model}` — HuggingFace model ID with `/` replaced by `_` (e.g. `Qwen_Qwen3-32B`) - `__temp0` suffix — runs at temperature 0 (greedy decoding); absent means temperature 0.6 ## File Formats ### nano-agent runs Each `run_N/` directory contains: | File | Description | |------|-------------| | `detailed_predictions.jsonl` | One record per instance. Contains full prompt/completion messages, the generated patch, exit reason, and token usage. | | `preds.jsonl` | Lightweight predictions file (instance_id + patch). | | `*.json` | SWE-bench evaluation results (see below). | ### r2e-gym runs Each run directory (named `traj_{model}_run_N/`) contains: | File | Description | |------|-------------| | `*.jsonl` (trajectories) | One record per instance. Contains `trajectory_steps` (thought, action, observation, token counts), `output_patch`, and `reward`. | | `*_swebv_eval_*.json` | SWE-bench evaluation results (see below). | | `*.json` (predictions) | Raw patch predictions (`instance_id`, `model_patch`). | ### SWE-bench results JSON The `*_swebv_eval_*.json` files follow the standard SWE-bench harness output format: ```json { "resolved_ids": ["django__django-10880", ...], "unresolved_ids": [...], "resolved_instances": 42, "total_instances": 500, ... } ``` ## Models and Scaffolds | Directory prefix | Scaffold | Model | |-----------------|----------|-------| | `nano-agent-Qwen_Qwen3-32B` | nano-agent | Qwen/Qwen3-32B | | `nano-agent-mistral_devstral-2512` | nano-agent | mistral/devstral-2512 | | `nano-agent-agentica-org_DeepSWE-Preview` | nano-agent | agentica-org/DeepSWE-Preview | | `r2e-gym-Qwen_Qwen3-32B` | r2e-gym | Qwen/Qwen3-32B | | `r2e-gym-mistral_devstral-2512` | r2e-gym | mistral/devstral-2512 | | `r2e-gym-agentica-org__DeepSWE-preview` | r2e-gym | agentica-org/DeepSWE-Preview | ## Citation If you use this data, please cite: ```bibtex @article{bjarnason2026randomness, title={On Randomness in Agentic Evals}, author={Bjarnason, Bjarni Haukur and Silva, Andr{\'e} and Monperrus, Martin}, journal={arXiv preprint arXiv:2602.07150}, year={2026} } ```
提供机构:
ASSERT-KTH
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作