akenginorhun/neurips-2026-evals
收藏Hugging Face2026-04-22 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/akenginorhun/neurips-2026-evals
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
- config_name: claude-opus
features:
- name: record_id
dtype: string
- name: benchmark
dtype: string
- name: model
dtype: string
- name: backend
dtype: string
- name: problem
dtype: string
- name: reference
dtype: string
- name: model_answer
dtype: string
- name: is_correct
dtype: bool
- name: score
dtype: float64
- name: latency_seconds
dtype: float64
- name: prompt_tokens
dtype: int64
- name: completion_tokens
dtype: int64
- name: cost_usd
dtype: float64
- name: throughput_tok_per_sec
dtype: float64
- name: energy_joules
dtype: float64
- name: power_watts
dtype: float64
- name: gpu_utilization_pct
dtype: float64
- name: estimated_flops
dtype: float64
- name: scoring_metadata
dtype: string
- name: error
dtype: string
splits:
- name: deepresearch
- name: gaia
- name: livecodebench
- name: liveresearch
- name: liveresearchbench
- name: pinchbench
- name: taubench
- name: taubench_telecom
- name: toolcall15
- config_name: gemini-31-pro
features:
- name: record_id
dtype: string
- name: benchmark
dtype: string
- name: model
dtype: string
- name: backend
dtype: string
- name: problem
dtype: string
- name: reference
dtype: string
- name: model_answer
dtype: string
- name: is_correct
dtype: bool
- name: score
dtype: float64
- name: latency_seconds
dtype: float64
- name: prompt_tokens
dtype: int64
- name: completion_tokens
dtype: int64
- name: cost_usd
dtype: float64
- name: throughput_tok_per_sec
dtype: float64
- name: energy_joules
dtype: float64
- name: power_watts
dtype: float64
- name: gpu_utilization_pct
dtype: float64
- name: estimated_flops
dtype: float64
- name: scoring_metadata
dtype: string
- name: error
dtype: string
splits:
- name: deepresearch
- name: gaia
- name: livecodebench
- name: liveresearch
- name: liveresearchbench
- name: pinchbench
- name: taubench
- name: taubench_telecom
- name: toolcall15
- config_name: gemma4-e4b
features:
- name: record_id
dtype: string
- name: benchmark
dtype: string
- name: model
dtype: string
- name: backend
dtype: string
- name: problem
dtype: string
- name: reference
dtype: string
- name: model_answer
dtype: string
- name: is_correct
dtype: bool
- name: score
dtype: float64
- name: latency_seconds
dtype: float64
- name: prompt_tokens
dtype: int64
- name: completion_tokens
dtype: int64
- name: cost_usd
dtype: float64
- name: throughput_tok_per_sec
dtype: float64
- name: energy_joules
dtype: float64
- name: power_watts
dtype: float64
- name: gpu_utilization_pct
dtype: float64
- name: estimated_flops
dtype: float64
- name: scoring_metadata
dtype: string
- name: error
dtype: string
splits:
- name: deepresearch
- name: gaia
- name: livecodebench
- name: liveresearchbench
- name: pinchbench
- name: taubench
- name: toolcall15
- config_name: gpt-54
features:
- name: record_id
dtype: string
- name: benchmark
dtype: string
- name: model
dtype: string
- name: backend
dtype: string
- name: problem
dtype: string
- name: reference
dtype: string
- name: model_answer
dtype: string
- name: is_correct
dtype: bool
- name: score
dtype: float64
- name: latency_seconds
dtype: float64
- name: prompt_tokens
dtype: int64
- name: completion_tokens
dtype: int64
- name: cost_usd
dtype: float64
- name: throughput_tok_per_sec
dtype: float64
- name: energy_joules
dtype: float64
- name: power_watts
dtype: float64
- name: gpu_utilization_pct
dtype: float64
- name: estimated_flops
dtype: float64
- name: scoring_metadata
dtype: string
- name: error
dtype: string
splits:
- name: deepresearch
- name: gaia
- name: livecodebench
- name: liveresearch
- name: liveresearchbench
- name: pinchbench
- name: taubench
- name: taubench_telecom
- name: toolcall15
- config_name: kimi-k25
features:
- name: record_id
dtype: string
- name: benchmark
dtype: string
- name: model
dtype: string
- name: backend
dtype: string
- name: problem
dtype: string
- name: reference
dtype: string
- name: model_answer
dtype: string
- name: is_correct
dtype: bool
- name: score
dtype: float64
- name: latency_seconds
dtype: float64
- name: prompt_tokens
dtype: int64
- name: completion_tokens
dtype: int64
- name: cost_usd
dtype: float64
- name: throughput_tok_per_sec
dtype: float64
- name: energy_joules
dtype: float64
- name: power_watts
dtype: float64
- name: gpu_utilization_pct
dtype: float64
- name: estimated_flops
dtype: float64
- name: scoring_metadata
dtype: string
- name: error
dtype: string
splits:
- name: gaia
- name: livecodebench
- name: pinchbench
- name: toolcall15
- config_name: lfm-12b
features:
- name: record_id
dtype: string
- name: benchmark
dtype: string
- name: model
dtype: string
- name: backend
dtype: string
- name: problem
dtype: string
- name: reference
dtype: string
- name: model_answer
dtype: string
- name: is_correct
dtype: bool
- name: score
dtype: float64
- name: latency_seconds
dtype: float64
- name: prompt_tokens
dtype: int64
- name: completion_tokens
dtype: int64
- name: cost_usd
dtype: float64
- name: throughput_tok_per_sec
dtype: float64
- name: energy_joules
dtype: float64
- name: power_watts
dtype: float64
- name: gpu_utilization_pct
dtype: float64
- name: estimated_flops
dtype: float64
- name: scoring_metadata
dtype: string
- name: error
dtype: string
splits:
- name: deepresearch
- name: gaia
- name: livecodebench
- name: liveresearchbench
- name: pinchbench
- name: taubench
- name: toolcall15
- config_name: minimax-m25
features:
- name: record_id
dtype: string
- name: benchmark
dtype: string
- name: model
dtype: string
- name: backend
dtype: string
- name: problem
dtype: string
- name: reference
dtype: string
- name: model_answer
dtype: string
- name: is_correct
dtype: bool
- name: score
dtype: float64
- name: latency_seconds
dtype: float64
- name: prompt_tokens
dtype: int64
- name: completion_tokens
dtype: int64
- name: cost_usd
dtype: float64
- name: throughput_tok_per_sec
dtype: float64
- name: energy_joules
dtype: float64
- name: power_watts
dtype: float64
- name: gpu_utilization_pct
dtype: float64
- name: estimated_flops
dtype: float64
- name: scoring_metadata
dtype: string
- name: error
dtype: string
splits:
- name: deepresearch
- name: gaia
- name: livecodebench
- name: liveresearchbench
- name: pinchbench
- name: taubench
- name: toolcall15
- config_name: nemotron-nano-30b
features:
- name: record_id
dtype: string
- name: benchmark
dtype: string
- name: model
dtype: string
- name: backend
dtype: string
- name: problem
dtype: string
- name: reference
dtype: string
- name: model_answer
dtype: string
- name: is_correct
dtype: bool
- name: score
dtype: float64
- name: latency_seconds
dtype: float64
- name: prompt_tokens
dtype: int64
- name: completion_tokens
dtype: int64
- name: cost_usd
dtype: float64
- name: throughput_tok_per_sec
dtype: float64
- name: energy_joules
dtype: float64
- name: power_watts
dtype: float64
- name: gpu_utilization_pct
dtype: float64
- name: estimated_flops
dtype: float64
- name: scoring_metadata
dtype: string
- name: error
dtype: string
splits:
- name: deepresearch
- name: gaia
- name: livecodebench
- name: liveresearchbench
- name: pinchbench
- name: taubench
- name: toolcall15
- config_name: nemotron-nano-4b-fp8
features:
- name: record_id
dtype: string
- name: benchmark
dtype: string
- name: model
dtype: string
- name: backend
dtype: string
- name: problem
dtype: string
- name: reference
dtype: string
- name: model_answer
dtype: string
- name: is_correct
dtype: bool
- name: score
dtype: float64
- name: latency_seconds
dtype: float64
- name: prompt_tokens
dtype: int64
- name: completion_tokens
dtype: int64
- name: cost_usd
dtype: float64
- name: throughput_tok_per_sec
dtype: float64
- name: energy_joules
dtype: float64
- name: power_watts
dtype: float64
- name: gpu_utilization_pct
dtype: float64
- name: estimated_flops
dtype: float64
- name: scoring_metadata
dtype: string
- name: error
dtype: string
splits:
- name: deepresearch
- name: gaia
- name: livecodebench
- name: liveresearchbench
- name: pinchbench
- name: taubench
- name: toolcall15
- config_name: qwen-27b
features:
- name: record_id
dtype: string
- name: benchmark
dtype: string
- name: model
dtype: string
- name: backend
dtype: string
- name: problem
dtype: string
- name: reference
dtype: string
- name: model_answer
dtype: string
- name: is_correct
dtype: bool
- name: score
dtype: float64
- name: latency_seconds
dtype: float64
- name: prompt_tokens
dtype: int64
- name: completion_tokens
dtype: int64
- name: cost_usd
dtype: float64
- name: throughput_tok_per_sec
dtype: float64
- name: energy_joules
dtype: float64
- name: power_watts
dtype: float64
- name: gpu_utilization_pct
dtype: float64
- name: estimated_flops
dtype: float64
- name: scoring_metadata
dtype: string
- name: error
dtype: string
splits:
- name: deepresearch
- name: gaia
- name: livecodebench
- name: pinchbench
- name: taubench
- name: toolcall15
- config_name: qwen-2b
features:
- name: record_id
dtype: string
- name: benchmark
dtype: string
- name: model
dtype: string
- name: backend
dtype: string
- name: problem
dtype: string
- name: reference
dtype: string
- name: model_answer
dtype: string
- name: is_correct
dtype: bool
- name: score
dtype: float64
- name: latency_seconds
dtype: float64
- name: prompt_tokens
dtype: int64
- name: completion_tokens
dtype: int64
- name: cost_usd
dtype: float64
- name: throughput_tok_per_sec
dtype: float64
- name: energy_joules
dtype: float64
- name: power_watts
dtype: float64
- name: gpu_utilization_pct
dtype: float64
- name: estimated_flops
dtype: float64
- name: scoring_metadata
dtype: string
- name: error
dtype: string
splits:
- name: deepresearch
- name: gaia
- name: livecodebench
- name: liveresearchbench
- name: pinchbench
- name: taubench
- name: toolcall15
- config_name: qwen-397b
features:
- name: record_id
dtype: string
- name: benchmark
dtype: string
- name: model
dtype: string
- name: backend
dtype: string
- name: problem
dtype: string
- name: reference
dtype: string
- name: model_answer
dtype: string
- name: is_correct
dtype: bool
- name: score
dtype: float64
- name: latency_seconds
dtype: float64
- name: prompt_tokens
dtype: int64
- name: completion_tokens
dtype: int64
- name: cost_usd
dtype: float64
- name: throughput_tok_per_sec
dtype: float64
- name: energy_joules
dtype: float64
- name: power_watts
dtype: float64
- name: gpu_utilization_pct
dtype: float64
- name: estimated_flops
dtype: float64
- name: scoring_metadata
dtype: string
- name: error
dtype: string
splits:
- name: deepresearch
- name: gaia
- name: livecodebench
- name: liveresearchbench
- name: pinchbench
- name: taubench
- name: toolcall15
- config_name: qwen-4b
features:
- name: record_id
dtype: string
- name: benchmark
dtype: string
- name: model
dtype: string
- name: backend
dtype: string
- name: problem
dtype: string
- name: reference
dtype: string
- name: model_answer
dtype: string
- name: is_correct
dtype: bool
- name: score
dtype: float64
- name: latency_seconds
dtype: float64
- name: prompt_tokens
dtype: int64
- name: completion_tokens
dtype: int64
- name: cost_usd
dtype: float64
- name: throughput_tok_per_sec
dtype: float64
- name: energy_joules
dtype: float64
- name: power_watts
dtype: float64
- name: gpu_utilization_pct
dtype: float64
- name: estimated_flops
dtype: float64
- name: scoring_metadata
dtype: string
- name: error
dtype: string
splits:
- name: deepresearch
- name: gaia
- name: liveresearchbench
- name: pinchbench
- name: taubench
- name: toolcall15
- config_name: qwen-9b
features:
- name: record_id
dtype: string
- name: benchmark
dtype: string
- name: model
dtype: string
- name: backend
dtype: string
- name: problem
dtype: string
- name: reference
dtype: string
- name: model_answer
dtype: string
- name: is_correct
dtype: bool
- name: score
dtype: float64
- name: latency_seconds
dtype: float64
- name: prompt_tokens
dtype: int64
- name: completion_tokens
dtype: int64
- name: cost_usd
dtype: float64
- name: throughput_tok_per_sec
dtype: float64
- name: energy_joules
dtype: float64
- name: power_watts
dtype: float64
- name: gpu_utilization_pct
dtype: float64
- name: estimated_flops
dtype: float64
- name: scoring_metadata
dtype: string
- name: error
dtype: string
splits:
- name: deepresearch
- name: gaia
- name: livecodebench
- name: liveresearchbench
- name: pinchbench
- name: taubench
- name: toolcall15
- config_name: trinity-large
features:
- name: record_id
dtype: string
- name: benchmark
dtype: string
- name: model
dtype: string
- name: backend
dtype: string
- name: problem
dtype: string
- name: reference
dtype: string
- name: model_answer
dtype: string
- name: is_correct
dtype: bool
- name: score
dtype: float64
- name: latency_seconds
dtype: float64
- name: prompt_tokens
dtype: int64
- name: completion_tokens
dtype: int64
- name: cost_usd
dtype: float64
- name: throughput_tok_per_sec
dtype: float64
- name: energy_joules
dtype: float64
- name: power_watts
dtype: float64
- name: gpu_utilization_pct
dtype: float64
- name: estimated_flops
dtype: float64
- name: scoring_metadata
dtype: string
- name: error
dtype: string
splits:
- name: deepresearch
- name: gaia
- name: livecodebench
- name: liveresearchbench
- name: pinchbench
- name: taubench
- name: toolcall15
configs:
- config_name: claude-opus
data_files:
- split: deepresearch
path: claude-opus/deepresearch/*
- split: gaia
path: claude-opus/gaia/*
- split: livecodebench
path: claude-opus/livecodebench/*
- split: liveresearch
path: claude-opus/liveresearch/*
- split: liveresearchbench
path: claude-opus/liveresearchbench/*
- split: pinchbench
path: claude-opus/pinchbench/*
- split: taubench
path: claude-opus/taubench/*
- split: taubench_telecom
path: claude-opus/taubench_telecom/*
- split: toolcall15
path: claude-opus/toolcall15/*
- config_name: gemini-31-pro
data_files:
- split: deepresearch
path: gemini-31-pro/deepresearch/*
- split: gaia
path: gemini-31-pro/gaia/*
- split: livecodebench
path: gemini-31-pro/livecodebench/*
- split: liveresearch
path: gemini-31-pro/liveresearch/*
- split: liveresearchbench
path: gemini-31-pro/liveresearchbench/*
- split: pinchbench
path: gemini-31-pro/pinchbench/*
- split: taubench
path: gemini-31-pro/taubench/*
- split: taubench_telecom
path: gemini-31-pro/taubench_telecom/*
- split: toolcall15
path: gemini-31-pro/toolcall15/*
- config_name: gemma4-e4b
data_files:
- split: deepresearch
path: gemma4-e4b/deepresearch/*
- split: gaia
path: gemma4-e4b/gaia/*
- split: livecodebench
path: gemma4-e4b/livecodebench/*
- split: liveresearchbench
path: gemma4-e4b/liveresearchbench/*
- split: pinchbench
path: gemma4-e4b/pinchbench/*
- split: taubench
path: gemma4-e4b/taubench/*
- split: toolcall15
path: gemma4-e4b/toolcall15/*
- config_name: gpt-54
data_files:
- split: deepresearch
path: gpt-54/deepresearch/*
- split: gaia
path: gpt-54/gaia/*
- split: livecodebench
path: gpt-54/livecodebench/*
- split: liveresearch
path: gpt-54/liveresearch/*
- split: liveresearchbench
path: gpt-54/liveresearchbench/*
- split: pinchbench
path: gpt-54/pinchbench/*
- split: taubench
path: gpt-54/taubench/*
- split: taubench_telecom
path: gpt-54/taubench_telecom/*
- split: toolcall15
path: gpt-54/toolcall15/*
- config_name: kimi-k25
data_files:
- split: gaia
path: kimi-k25/gaia/*
- split: livecodebench
path: kimi-k25/livecodebench/*
- split: pinchbench
path: kimi-k25/pinchbench/*
- split: toolcall15
path: kimi-k25/toolcall15/*
- config_name: lfm-12b
data_files:
- split: deepresearch
path: lfm-12b/deepresearch/*
- split: gaia
path: lfm-12b/gaia/*
- split: livecodebench
path: lfm-12b/livecodebench/*
- split: liveresearchbench
path: lfm-12b/liveresearchbench/*
- split: pinchbench
path: lfm-12b/pinchbench/*
- split: taubench
path: lfm-12b/taubench/*
- split: toolcall15
path: lfm-12b/toolcall15/*
- config_name: minimax-m25
data_files:
- split: deepresearch
path: minimax-m25/deepresearch/*
- split: gaia
path: minimax-m25/gaia/*
- split: livecodebench
path: minimax-m25/livecodebench/*
- split: liveresearchbench
path: minimax-m25/liveresearchbench/*
- split: pinchbench
path: minimax-m25/pinchbench/*
- split: taubench
path: minimax-m25/taubench/*
- split: toolcall15
path: minimax-m25/toolcall15/*
- config_name: nemotron-nano-30b
data_files:
- split: deepresearch
path: nemotron-nano-30b/deepresearch/*
- split: gaia
path: nemotron-nano-30b/gaia/*
- split: livecodebench
path: nemotron-nano-30b/livecodebench/*
- split: liveresearchbench
path: nemotron-nano-30b/liveresearchbench/*
- split: pinchbench
path: nemotron-nano-30b/pinchbench/*
- split: taubench
path: nemotron-nano-30b/taubench/*
- split: toolcall15
path: nemotron-nano-30b/toolcall15/*
- config_name: nemotron-nano-4b-fp8
data_files:
- split: deepresearch
path: nemotron-nano-4b-fp8/deepresearch/*
- split: gaia
path: nemotron-nano-4b-fp8/gaia/*
- split: livecodebench
path: nemotron-nano-4b-fp8/livecodebench/*
- split: liveresearchbench
path: nemotron-nano-4b-fp8/liveresearchbench/*
- split: pinchbench
path: nemotron-nano-4b-fp8/pinchbench/*
- split: taubench
path: nemotron-nano-4b-fp8/taubench/*
- split: toolcall15
path: nemotron-nano-4b-fp8/toolcall15/*
- config_name: qwen-27b
data_files:
- split: deepresearch
path: qwen-27b/deepresearch/*
- split: gaia
path: qwen-27b/gaia/*
- split: livecodebench
path: qwen-27b/livecodebench/*
- split: pinchbench
path: qwen-27b/pinchbench/*
- split: taubench
path: qwen-27b/taubench/*
- split: toolcall15
path: qwen-27b/toolcall15/*
- config_name: qwen-2b
data_files:
- split: deepresearch
path: qwen-2b/deepresearch/*
- split: gaia
path: qwen-2b/gaia/*
- split: livecodebench
path: qwen-2b/livecodebench/*
- split: liveresearchbench
path: qwen-2b/liveresearchbench/*
- split: pinchbench
path: qwen-2b/pinchbench/*
- split: taubench
path: qwen-2b/taubench/*
- split: toolcall15
path: qwen-2b/toolcall15/*
- config_name: qwen-397b
data_files:
- split: deepresearch
path: qwen-397b/deepresearch/*
- split: gaia
path: qwen-397b/gaia/*
- split: livecodebench
path: qwen-397b/livecodebench/*
- split: liveresearchbench
path: qwen-397b/liveresearchbench/*
- split: pinchbench
path: qwen-397b/pinchbench/*
- split: taubench
path: qwen-397b/taubench/*
- split: toolcall15
path: qwen-397b/toolcall15/*
- config_name: qwen-4b
data_files:
- split: deepresearch
path: qwen-4b/deepresearch/*
- split: gaia
path: qwen-4b/gaia/*
- split: liveresearchbench
path: qwen-4b/liveresearchbench/*
- split: pinchbench
path: qwen-4b/pinchbench/*
- split: taubench
path: qwen-4b/taubench/*
- split: toolcall15
path: qwen-4b/toolcall15/*
- config_name: qwen-9b
data_files:
- split: deepresearch
path: qwen-9b/deepresearch/*
- split: gaia
path: qwen-9b/gaia/*
- split: livecodebench
path: qwen-9b/livecodebench/*
- split: liveresearchbench
path: qwen-9b/liveresearchbench/*
- split: pinchbench
path: qwen-9b/pinchbench/*
- split: taubench
path: qwen-9b/taubench/*
- split: toolcall15
path: qwen-9b/toolcall15/*
- config_name: trinity-large
data_files:
- split: deepresearch
path: trinity-large/deepresearch/*
- split: gaia
path: trinity-large/gaia/*
- split: livecodebench
path: trinity-large/livecodebench/*
- split: liveresearchbench
path: trinity-large/liveresearchbench/*
- split: pinchbench
path: trinity-large/pinchbench/*
- split: taubench
path: trinity-large/taubench/*
- split: toolcall15
path: trinity-large/toolcall15/*
---
# NeurIPS 2026 Agent Evaluation Dataset
This dataset contains evaluation results for various AI agents across multiple benchmarks.
## Dataset Structure
The dataset is organized by model (as configs) with each benchmark as a split.
Each model/benchmark folder contains:
- Main results file (`.jsonl` or `.parquet` format)
- Summary statistics (`.summary.json`) - for models with metadata
- Configuration file (`.toml`) - for models with metadata
- Traces folder with execution traces (`traces/traces.jsonl`) - for models with metadata
## Models
- `claude-opus`: 9 benchmarks
- `gemini-31-pro`: 9 benchmarks
- `gemma4-e4b`: 7 benchmarks
- `gpt-54`: 9 benchmarks
- `kimi-k25`: 4 benchmarks
- `lfm-12b`: 7 benchmarks
- `minimax-m25`: 7 benchmarks
- `nemotron-nano-30b`: 7 benchmarks
- `nemotron-nano-4b-fp8`: 7 benchmarks
- `qwen-27b`: 6 benchmarks
- `qwen-2b`: 7 benchmarks
- `qwen-397b`: 7 benchmarks
- `qwen-4b`: 6 benchmarks
- `qwen-9b`: 7 benchmarks
- `trinity-large`: 7 benchmarks
## Benchmarks
- `deepresearch`
- `gaia`
- `livecodebench`
- `liveresearch`
- `liveresearchbench`
- `pinchbench`
- `taubench`
- `taubench_telecom`
- `terminalbench`
- `toolcall15`
## Usage
```python
from datasets import load_dataset
# Load a specific model's results
dataset = load_dataset('akenginorhun/neurips-2026-evals', name='claude-opus')
# Load a specific benchmark for a model
dataset = load_dataset('akenginorhun/neurips-2026-evals', name='qwen-27b', split='gaia')
```
提供机构:
akenginorhun



