KikoCis/real-world-agent-benchmark
收藏Hugging Face2026-04-21 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/KikoCis/real-world-agent-benchmark
下载链接
链接失效反馈官方服务:
资源简介:
# Real-World Agent Benchmark (RAB)
**4 autonomous Docker challenges that test what benchmarks can't: does your model actually work as an agent?**
Standard benchmarks (BFCL, HumanEval, SWE-bench) test narrow skills in isolation. RAB tests whether a model can autonomously solve complex, multi-step real-world tasks inside a Docker container — downloading data, writing scripts, debugging errors, and producing deliverables.
## The Benchmark Trap
We discovered that a model scoring **95% on BFCL** (Berkeley Function Calling Leaderboard) scores **0/10** on real autonomous tasks — it formats JSON perfectly but can't recover from errors, enters infinite loops, and never completes the task. Meanwhile, the same model's **unfine-tuned base** (80% BFCL) scores **6/10**.
This benchmark exists because standard metrics hide these critical failures.
## Challenges
| # | Domain | Task | Time Limit |
|---|--------|------|-----------|
| 1 | **Bioinformatics** | Download P53_HUMAN from UniProt, parse JSON, extract secondary structure + cancer mutations, generate HTML report | 30 min |
| 2 | **Security CTF** | Deploy DVWA, exploit SQL injection + XSS + command injection, document with proof | 30 min |
| 3 | **Data Engineering** | Download NYC taxi data (3M rows), clean, compute analytics, generate interactive Chart.js dashboard | 30 min |
| 4 | **DevOps** | Set up Flask app + Nginx reverse proxy + Prometheus monitoring + health check + status dashboard | 30 min |
## Results
| Model | Size | CH1 | CH2 | CH3 | CH4 | Total | Cost | Notes |
|-------|------|:---:|:---:|:---:|:---:|:-----:|------|-------|
| **Claude Opus 4.6** | Cloud | 9 | 0 | 9 | 10 | **28/40** | $4.72 | Pro-quality output, 3 services running in CH4 |
| **Gemma4 31B v6** | 16 GB | 10 | 1 | 6 | 10 | **27/40** | $0.00 | Local, 75% fewer tokens, 5x slower |
| Gemma4 31B Base q4 | 16 GB | 9 | — | — | — | — | $0.00 | No fine-tune, already capable |
| Gemma4 31B IQ2_M | 10 GB | 2 | — | — | — | — | $0.00 | Quantization destroys agent capability |
| Gemma4 E4B v5 | 4 GB | 7 | — | — | — | — | $0.00 | Reasoning fine-tune helps but attention too shallow |
| Gemma4 E4B Base | 3.8 GB | 6 | — | — | — | — | $0.00 | Better than fine-tuned E4B v3 |
| **Gemma4 E4B v3** | **4.5 GB** | **0** | — | — | — | — | $0.00 | **95% BFCL but 0/10 agent — The Benchmark Trap** |
### Quality-Adjusted Scores (manual review)
Automated scoring inflates results by detecting keywords without verifying correctness:
| Model | Auto Score | Real Score | Gap | Issue |
|-------|:---------:|:---------:|:---:|-------|
| Opus 4.6 | 28/40 | ~24/40 | -4 | CH1: data parsing bug, CH2: container permissions |
| 31B v6 | 27/40 | ~18/40 | -9 | CH1: same parsing bug, CH4: services not actually running |
## How It Works
Each model runs inside the same Docker container (`Ubuntu 24.04` with Python, Node.js, nginx, build tools). The model receives a task description and must:
1. Plan the approach
2. Execute commands (bash, write files, install packages)
3. Handle errors and adapt
4. Produce deliverables in `/results/`
### Automated Scoring
Each challenge has criteria scored 0-10 with anti-pattern detection:
- **Completion criteria**: files generated, data present, keywords found
- **Anti-patterns**: command repetition loops (-1 per loop detected), error blindness
- **Quality**: HTML report size, data correctness, visualization presence
### Running the Benchmark
```bash
# Build the Docker image
docker build -t agent-test-base -f agent_test/Dockerfile.claude agent_test/
# Run with any OpenAI-compatible API
python3 agent_test/agent_runner.py \
--api-url http://localhost:8080/v1 \
--model your-model \
--prompt "..." \
--results-dir /results \
--timeout 1800
# Score results
python3 agent_test/score_challenge.py --challenge 1 --results-dir /tmp/results --model-name "your-model"
```
## Key Findings
1. **Benchmarks ≠ Agent capability.** 95% BFCL = 0% agent. Standard metrics are necessary but not sufficient.
2. **Model size matters for agency.** 4.5B params (42 layers, 8 attention heads) can't sustain multi-turn reasoning. 31B (60 layers, 16 heads) can.
3. **Fine-tuning can destroy capability.** BFCL-focused training eliminated error recovery and anti-repetition in E4B.
4. **Fine-tuning can also build capability.** The v6 dataset (43% multi-turn trajectories, 11% error recovery) improved 31B from 2/10 to 10/10 on CH1.
5. **Local 16 GB model achieves ~75% of Opus quality at $0 cost.**
## Citation
```
@misc{cisneros2026rab,
title={Real-World Agent Benchmark: Why 95\% BFCL Produces a 0\% Agent},
author={Cisneros, Kiko and Claude Opus 4.6},
year={2026},
publisher={Utopia IA},
url={https://github.com/KikoCisBot/gemma4-31b-study}
}
```
## License
Apache 2.0
提供机构:
KikoCis



