sripad17/agentopt-benchmark-cache
收藏Hugging Face2026-04-01 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/sripad17/agentopt-benchmark-cache
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
task_categories:
- question-answering
- text-generation
tags:
- benchmark
- model-selection
- bedrock
- llm-evaluation
size_categories:
- 10K<n<100K
---
# AgentOpt Benchmark Cache
SQLite cache of all AWS Bedrock API calls from the AgentOpt benchmark evaluation suite. Enables full replay of benchmark results with zero API calls.
## What's Inside
~70,000 cached API responses across 4 benchmarks and 9 models, plus thinking ablation runs.
| Benchmark | Samples | Model Combos | Total Entries | Description |
|---|---|---|---|---|
| GPQA Diamond | 198 | 9 (1-tuple) | ~1,782 | Graduate-level science QA (A/B/C/D) |
| BFCL | 200 | 9 (1-tuple) | ~1,800 | Multi-turn function calling |
| HotpotQA | 200 | 81 (2-tuple: planner × solver) | ~16,200 | Multi-hop QA with planning |
| MathQA | 200 | 81 (2-tuple: answer × critic) | ~16,200 | Self-reflective math QA |
| GPQA Thinking Ablation | 198 | 8 configs (Opus + Haiku 4.5) | ~1,584 | Thinking effort impact study |
## Models Evaluated
All models accessed via AWS Bedrock Application Inference Profiles:
| Model | Provider | Input $/MTok | Output $/MTok |
|---|---|---|---|
| Claude 3 Haiku | Anthropic | $0.25 | $1.25 |
| Claude Haiku 4.5 | Anthropic | $0.80 | $4.00 |
| Claude Opus 4.6 | Anthropic | $5.00 | $25.00 |
| gpt-oss-20b | OpenAI | $0.22 | $0.88 |
| gpt-oss-120b | OpenAI | $1.20 | $4.80 |
| Kimi K2.5 | MoonshotAI | $0.35 | $1.40 |
| Ministral 3 8B | Mistral | $0.04 | $0.04 |
| Qwen3 32B | Qwen | $0.17 | $0.85 |
| Qwen3 Next 80B A3B | Qwen | $0.25 | $1.25 |
## How to Use
1. Download `cache.db` and place it at `agentopt/.agentopt_cache/cache.db`
2. Run benchmarks with `LLMTracker(cache=True)` — all API calls will replay from cache instantly
```python
from agentopt import LLMTracker
tracker = LLMTracker(cache=True, cache_dir="agentopt/.agentopt_cache")
tracker.start()
# Run any benchmark — all Bedrock calls will be served from cache
# No AWS credentials or API calls needed
```
Or use the cache selector simulator directly:
```bash
python cache_selector_sim.py --benchmark gpqa --selectors all --seeds 50
```
## Schema
Single table `cache` with two columns:
| Column | Type | Description |
|---|---|---|
| `key` | TEXT (PRIMARY KEY) | SHA-256 hash of the canonical request body |
| `data_json` | TEXT | JSON containing: `response_bytes_b64` (base64-encoded full Bedrock response), `response_headers`, `latency_seconds` (original wall time), `request_body` (full request including messages, model ARN, inference config) |
Each cached response includes:
- Full model output (text + reasoning/thinking content blocks)
- `usage` (inputTokens, outputTokens, totalTokens)
- `metrics.latencyMs` (server-side processing time)
- `stopReason` (end_turn, max_tokens, content_filtered)
## Date Collected
March 2026, using AWS Bedrock on-demand inference in us-east-1.
## Associated Repository
[github.com/AgentOptimizer/agentopt](https://github.com/AgentOptimizer/agentopt)
提供机构:
sripad17



