five

KikoCis/real-world-agent-benchmark

收藏
Hugging Face2026-04-21 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/KikoCis/real-world-agent-benchmark
下载链接
链接失效反馈
官方服务:
资源简介:
# Real-World Agent Benchmark (RAB) **4 autonomous Docker challenges that test what benchmarks can't: does your model actually work as an agent?** Standard benchmarks (BFCL, HumanEval, SWE-bench) test narrow skills in isolation. RAB tests whether a model can autonomously solve complex, multi-step real-world tasks inside a Docker container — downloading data, writing scripts, debugging errors, and producing deliverables. ## The Benchmark Trap We discovered that a model scoring **95% on BFCL** (Berkeley Function Calling Leaderboard) scores **0/10** on real autonomous tasks — it formats JSON perfectly but can't recover from errors, enters infinite loops, and never completes the task. Meanwhile, the same model's **unfine-tuned base** (80% BFCL) scores **6/10**. This benchmark exists because standard metrics hide these critical failures. ## Challenges | # | Domain | Task | Time Limit | |---|--------|------|-----------| | 1 | **Bioinformatics** | Download P53_HUMAN from UniProt, parse JSON, extract secondary structure + cancer mutations, generate HTML report | 30 min | | 2 | **Security CTF** | Deploy DVWA, exploit SQL injection + XSS + command injection, document with proof | 30 min | | 3 | **Data Engineering** | Download NYC taxi data (3M rows), clean, compute analytics, generate interactive Chart.js dashboard | 30 min | | 4 | **DevOps** | Set up Flask app + Nginx reverse proxy + Prometheus monitoring + health check + status dashboard | 30 min | ## Results | Model | Size | CH1 | CH2 | CH3 | CH4 | Total | Cost | Notes | |-------|------|:---:|:---:|:---:|:---:|:-----:|------|-------| | **Claude Opus 4.6** | Cloud | 9 | 0 | 9 | 10 | **28/40** | $4.72 | Pro-quality output, 3 services running in CH4 | | **Gemma4 31B v6** | 16 GB | 10 | 1 | 6 | 10 | **27/40** | $0.00 | Local, 75% fewer tokens, 5x slower | | Gemma4 31B Base q4 | 16 GB | 9 | — | — | — | — | $0.00 | No fine-tune, already capable | | Gemma4 31B IQ2_M | 10 GB | 2 | — | — | — | — | $0.00 | Quantization destroys agent capability | | Gemma4 E4B v5 | 4 GB | 7 | — | — | — | — | $0.00 | Reasoning fine-tune helps but attention too shallow | | Gemma4 E4B Base | 3.8 GB | 6 | — | — | — | — | $0.00 | Better than fine-tuned E4B v3 | | **Gemma4 E4B v3** | **4.5 GB** | **0** | — | — | — | — | $0.00 | **95% BFCL but 0/10 agent — The Benchmark Trap** | ### Quality-Adjusted Scores (manual review) Automated scoring inflates results by detecting keywords without verifying correctness: | Model | Auto Score | Real Score | Gap | Issue | |-------|:---------:|:---------:|:---:|-------| | Opus 4.6 | 28/40 | ~24/40 | -4 | CH1: data parsing bug, CH2: container permissions | | 31B v6 | 27/40 | ~18/40 | -9 | CH1: same parsing bug, CH4: services not actually running | ## How It Works Each model runs inside the same Docker container (`Ubuntu 24.04` with Python, Node.js, nginx, build tools). The model receives a task description and must: 1. Plan the approach 2. Execute commands (bash, write files, install packages) 3. Handle errors and adapt 4. Produce deliverables in `/results/` ### Automated Scoring Each challenge has criteria scored 0-10 with anti-pattern detection: - **Completion criteria**: files generated, data present, keywords found - **Anti-patterns**: command repetition loops (-1 per loop detected), error blindness - **Quality**: HTML report size, data correctness, visualization presence ### Running the Benchmark ```bash # Build the Docker image docker build -t agent-test-base -f agent_test/Dockerfile.claude agent_test/ # Run with any OpenAI-compatible API python3 agent_test/agent_runner.py \ --api-url http://localhost:8080/v1 \ --model your-model \ --prompt "..." \ --results-dir /results \ --timeout 1800 # Score results python3 agent_test/score_challenge.py --challenge 1 --results-dir /tmp/results --model-name "your-model" ``` ## Key Findings 1. **Benchmarks ≠ Agent capability.** 95% BFCL = 0% agent. Standard metrics are necessary but not sufficient. 2. **Model size matters for agency.** 4.5B params (42 layers, 8 attention heads) can't sustain multi-turn reasoning. 31B (60 layers, 16 heads) can. 3. **Fine-tuning can destroy capability.** BFCL-focused training eliminated error recovery and anti-repetition in E4B. 4. **Fine-tuning can also build capability.** The v6 dataset (43% multi-turn trajectories, 11% error recovery) improved 31B from 2/10 to 10/10 on CH1. 5. **Local 16 GB model achieves ~75% of Opus quality at $0 cost.** ## Citation ``` @misc{cisneros2026rab, title={Real-World Agent Benchmark: Why 95\% BFCL Produces a 0\% Agent}, author={Cisneros, Kiko and Claude Opus 4.6}, year={2026}, publisher={Utopia IA}, url={https://github.com/KikoCisBot/gemma4-31b-study} } ``` ## License Apache 2.0
提供机构:
KikoCis
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作