five

sripad17/agentopt-benchmark-cache

收藏
Hugging Face2026-04-01 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/sripad17/agentopt-benchmark-cache
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 task_categories: - question-answering - text-generation tags: - benchmark - model-selection - bedrock - llm-evaluation size_categories: - 10K<n<100K --- # AgentOpt Benchmark Cache SQLite cache of all AWS Bedrock API calls from the AgentOpt benchmark evaluation suite. Enables full replay of benchmark results with zero API calls. ## What's Inside ~70,000 cached API responses across 4 benchmarks and 9 models, plus thinking ablation runs. | Benchmark | Samples | Model Combos | Total Entries | Description | |---|---|---|---|---| | GPQA Diamond | 198 | 9 (1-tuple) | ~1,782 | Graduate-level science QA (A/B/C/D) | | BFCL | 200 | 9 (1-tuple) | ~1,800 | Multi-turn function calling | | HotpotQA | 200 | 81 (2-tuple: planner × solver) | ~16,200 | Multi-hop QA with planning | | MathQA | 200 | 81 (2-tuple: answer × critic) | ~16,200 | Self-reflective math QA | | GPQA Thinking Ablation | 198 | 8 configs (Opus + Haiku 4.5) | ~1,584 | Thinking effort impact study | ## Models Evaluated All models accessed via AWS Bedrock Application Inference Profiles: | Model | Provider | Input $/MTok | Output $/MTok | |---|---|---|---| | Claude 3 Haiku | Anthropic | $0.25 | $1.25 | | Claude Haiku 4.5 | Anthropic | $0.80 | $4.00 | | Claude Opus 4.6 | Anthropic | $5.00 | $25.00 | | gpt-oss-20b | OpenAI | $0.22 | $0.88 | | gpt-oss-120b | OpenAI | $1.20 | $4.80 | | Kimi K2.5 | MoonshotAI | $0.35 | $1.40 | | Ministral 3 8B | Mistral | $0.04 | $0.04 | | Qwen3 32B | Qwen | $0.17 | $0.85 | | Qwen3 Next 80B A3B | Qwen | $0.25 | $1.25 | ## How to Use 1. Download `cache.db` and place it at `agentopt/.agentopt_cache/cache.db` 2. Run benchmarks with `LLMTracker(cache=True)` — all API calls will replay from cache instantly ```python from agentopt import LLMTracker tracker = LLMTracker(cache=True, cache_dir="agentopt/.agentopt_cache") tracker.start() # Run any benchmark — all Bedrock calls will be served from cache # No AWS credentials or API calls needed ``` Or use the cache selector simulator directly: ```bash python cache_selector_sim.py --benchmark gpqa --selectors all --seeds 50 ``` ## Schema Single table `cache` with two columns: | Column | Type | Description | |---|---|---| | `key` | TEXT (PRIMARY KEY) | SHA-256 hash of the canonical request body | | `data_json` | TEXT | JSON containing: `response_bytes_b64` (base64-encoded full Bedrock response), `response_headers`, `latency_seconds` (original wall time), `request_body` (full request including messages, model ARN, inference config) | Each cached response includes: - Full model output (text + reasoning/thinking content blocks) - `usage` (inputTokens, outputTokens, totalTokens) - `metrics.latencyMs` (server-side processing time) - `stopReason` (end_turn, max_tokens, content_filtered) ## Date Collected March 2026, using AWS Bedrock on-demand inference in us-east-1. ## Associated Repository [github.com/AgentOptimizer/agentopt](https://github.com/AgentOptimizer/agentopt)
提供机构:
sripad17
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作