five

akenginorhun/neurips-2026-evals

收藏
Hugging Face2026-04-22 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/akenginorhun/neurips-2026-evals
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: - config_name: claude-opus features: - name: record_id dtype: string - name: benchmark dtype: string - name: model dtype: string - name: backend dtype: string - name: problem dtype: string - name: reference dtype: string - name: model_answer dtype: string - name: is_correct dtype: bool - name: score dtype: float64 - name: latency_seconds dtype: float64 - name: prompt_tokens dtype: int64 - name: completion_tokens dtype: int64 - name: cost_usd dtype: float64 - name: throughput_tok_per_sec dtype: float64 - name: energy_joules dtype: float64 - name: power_watts dtype: float64 - name: gpu_utilization_pct dtype: float64 - name: estimated_flops dtype: float64 - name: scoring_metadata dtype: string - name: error dtype: string splits: - name: deepresearch - name: gaia - name: livecodebench - name: liveresearch - name: liveresearchbench - name: pinchbench - name: taubench - name: taubench_telecom - name: toolcall15 - config_name: gemini-31-pro features: - name: record_id dtype: string - name: benchmark dtype: string - name: model dtype: string - name: backend dtype: string - name: problem dtype: string - name: reference dtype: string - name: model_answer dtype: string - name: is_correct dtype: bool - name: score dtype: float64 - name: latency_seconds dtype: float64 - name: prompt_tokens dtype: int64 - name: completion_tokens dtype: int64 - name: cost_usd dtype: float64 - name: throughput_tok_per_sec dtype: float64 - name: energy_joules dtype: float64 - name: power_watts dtype: float64 - name: gpu_utilization_pct dtype: float64 - name: estimated_flops dtype: float64 - name: scoring_metadata dtype: string - name: error dtype: string splits: - name: deepresearch - name: gaia - name: livecodebench - name: liveresearch - name: liveresearchbench - name: pinchbench - name: taubench - name: taubench_telecom - name: toolcall15 - config_name: gemma4-e4b features: - name: record_id dtype: string - name: benchmark dtype: string - name: model dtype: string - name: backend dtype: string - name: problem dtype: string - name: reference dtype: string - name: model_answer dtype: string - name: is_correct dtype: bool - name: score dtype: float64 - name: latency_seconds dtype: float64 - name: prompt_tokens dtype: int64 - name: completion_tokens dtype: int64 - name: cost_usd dtype: float64 - name: throughput_tok_per_sec dtype: float64 - name: energy_joules dtype: float64 - name: power_watts dtype: float64 - name: gpu_utilization_pct dtype: float64 - name: estimated_flops dtype: float64 - name: scoring_metadata dtype: string - name: error dtype: string splits: - name: deepresearch - name: gaia - name: livecodebench - name: liveresearchbench - name: pinchbench - name: taubench - name: toolcall15 - config_name: gpt-54 features: - name: record_id dtype: string - name: benchmark dtype: string - name: model dtype: string - name: backend dtype: string - name: problem dtype: string - name: reference dtype: string - name: model_answer dtype: string - name: is_correct dtype: bool - name: score dtype: float64 - name: latency_seconds dtype: float64 - name: prompt_tokens dtype: int64 - name: completion_tokens dtype: int64 - name: cost_usd dtype: float64 - name: throughput_tok_per_sec dtype: float64 - name: energy_joules dtype: float64 - name: power_watts dtype: float64 - name: gpu_utilization_pct dtype: float64 - name: estimated_flops dtype: float64 - name: scoring_metadata dtype: string - name: error dtype: string splits: - name: deepresearch - name: gaia - name: livecodebench - name: liveresearch - name: liveresearchbench - name: pinchbench - name: taubench - name: taubench_telecom - name: toolcall15 - config_name: kimi-k25 features: - name: record_id dtype: string - name: benchmark dtype: string - name: model dtype: string - name: backend dtype: string - name: problem dtype: string - name: reference dtype: string - name: model_answer dtype: string - name: is_correct dtype: bool - name: score dtype: float64 - name: latency_seconds dtype: float64 - name: prompt_tokens dtype: int64 - name: completion_tokens dtype: int64 - name: cost_usd dtype: float64 - name: throughput_tok_per_sec dtype: float64 - name: energy_joules dtype: float64 - name: power_watts dtype: float64 - name: gpu_utilization_pct dtype: float64 - name: estimated_flops dtype: float64 - name: scoring_metadata dtype: string - name: error dtype: string splits: - name: gaia - name: livecodebench - name: pinchbench - name: toolcall15 - config_name: lfm-12b features: - name: record_id dtype: string - name: benchmark dtype: string - name: model dtype: string - name: backend dtype: string - name: problem dtype: string - name: reference dtype: string - name: model_answer dtype: string - name: is_correct dtype: bool - name: score dtype: float64 - name: latency_seconds dtype: float64 - name: prompt_tokens dtype: int64 - name: completion_tokens dtype: int64 - name: cost_usd dtype: float64 - name: throughput_tok_per_sec dtype: float64 - name: energy_joules dtype: float64 - name: power_watts dtype: float64 - name: gpu_utilization_pct dtype: float64 - name: estimated_flops dtype: float64 - name: scoring_metadata dtype: string - name: error dtype: string splits: - name: deepresearch - name: gaia - name: livecodebench - name: liveresearchbench - name: pinchbench - name: taubench - name: toolcall15 - config_name: minimax-m25 features: - name: record_id dtype: string - name: benchmark dtype: string - name: model dtype: string - name: backend dtype: string - name: problem dtype: string - name: reference dtype: string - name: model_answer dtype: string - name: is_correct dtype: bool - name: score dtype: float64 - name: latency_seconds dtype: float64 - name: prompt_tokens dtype: int64 - name: completion_tokens dtype: int64 - name: cost_usd dtype: float64 - name: throughput_tok_per_sec dtype: float64 - name: energy_joules dtype: float64 - name: power_watts dtype: float64 - name: gpu_utilization_pct dtype: float64 - name: estimated_flops dtype: float64 - name: scoring_metadata dtype: string - name: error dtype: string splits: - name: deepresearch - name: gaia - name: livecodebench - name: liveresearchbench - name: pinchbench - name: taubench - name: toolcall15 - config_name: nemotron-nano-30b features: - name: record_id dtype: string - name: benchmark dtype: string - name: model dtype: string - name: backend dtype: string - name: problem dtype: string - name: reference dtype: string - name: model_answer dtype: string - name: is_correct dtype: bool - name: score dtype: float64 - name: latency_seconds dtype: float64 - name: prompt_tokens dtype: int64 - name: completion_tokens dtype: int64 - name: cost_usd dtype: float64 - name: throughput_tok_per_sec dtype: float64 - name: energy_joules dtype: float64 - name: power_watts dtype: float64 - name: gpu_utilization_pct dtype: float64 - name: estimated_flops dtype: float64 - name: scoring_metadata dtype: string - name: error dtype: string splits: - name: deepresearch - name: gaia - name: livecodebench - name: liveresearchbench - name: pinchbench - name: taubench - name: toolcall15 - config_name: nemotron-nano-4b-fp8 features: - name: record_id dtype: string - name: benchmark dtype: string - name: model dtype: string - name: backend dtype: string - name: problem dtype: string - name: reference dtype: string - name: model_answer dtype: string - name: is_correct dtype: bool - name: score dtype: float64 - name: latency_seconds dtype: float64 - name: prompt_tokens dtype: int64 - name: completion_tokens dtype: int64 - name: cost_usd dtype: float64 - name: throughput_tok_per_sec dtype: float64 - name: energy_joules dtype: float64 - name: power_watts dtype: float64 - name: gpu_utilization_pct dtype: float64 - name: estimated_flops dtype: float64 - name: scoring_metadata dtype: string - name: error dtype: string splits: - name: deepresearch - name: gaia - name: livecodebench - name: liveresearchbench - name: pinchbench - name: taubench - name: toolcall15 - config_name: qwen-27b features: - name: record_id dtype: string - name: benchmark dtype: string - name: model dtype: string - name: backend dtype: string - name: problem dtype: string - name: reference dtype: string - name: model_answer dtype: string - name: is_correct dtype: bool - name: score dtype: float64 - name: latency_seconds dtype: float64 - name: prompt_tokens dtype: int64 - name: completion_tokens dtype: int64 - name: cost_usd dtype: float64 - name: throughput_tok_per_sec dtype: float64 - name: energy_joules dtype: float64 - name: power_watts dtype: float64 - name: gpu_utilization_pct dtype: float64 - name: estimated_flops dtype: float64 - name: scoring_metadata dtype: string - name: error dtype: string splits: - name: deepresearch - name: gaia - name: livecodebench - name: pinchbench - name: taubench - name: toolcall15 - config_name: qwen-2b features: - name: record_id dtype: string - name: benchmark dtype: string - name: model dtype: string - name: backend dtype: string - name: problem dtype: string - name: reference dtype: string - name: model_answer dtype: string - name: is_correct dtype: bool - name: score dtype: float64 - name: latency_seconds dtype: float64 - name: prompt_tokens dtype: int64 - name: completion_tokens dtype: int64 - name: cost_usd dtype: float64 - name: throughput_tok_per_sec dtype: float64 - name: energy_joules dtype: float64 - name: power_watts dtype: float64 - name: gpu_utilization_pct dtype: float64 - name: estimated_flops dtype: float64 - name: scoring_metadata dtype: string - name: error dtype: string splits: - name: deepresearch - name: gaia - name: livecodebench - name: liveresearchbench - name: pinchbench - name: taubench - name: toolcall15 - config_name: qwen-397b features: - name: record_id dtype: string - name: benchmark dtype: string - name: model dtype: string - name: backend dtype: string - name: problem dtype: string - name: reference dtype: string - name: model_answer dtype: string - name: is_correct dtype: bool - name: score dtype: float64 - name: latency_seconds dtype: float64 - name: prompt_tokens dtype: int64 - name: completion_tokens dtype: int64 - name: cost_usd dtype: float64 - name: throughput_tok_per_sec dtype: float64 - name: energy_joules dtype: float64 - name: power_watts dtype: float64 - name: gpu_utilization_pct dtype: float64 - name: estimated_flops dtype: float64 - name: scoring_metadata dtype: string - name: error dtype: string splits: - name: deepresearch - name: gaia - name: livecodebench - name: liveresearchbench - name: pinchbench - name: taubench - name: toolcall15 - config_name: qwen-4b features: - name: record_id dtype: string - name: benchmark dtype: string - name: model dtype: string - name: backend dtype: string - name: problem dtype: string - name: reference dtype: string - name: model_answer dtype: string - name: is_correct dtype: bool - name: score dtype: float64 - name: latency_seconds dtype: float64 - name: prompt_tokens dtype: int64 - name: completion_tokens dtype: int64 - name: cost_usd dtype: float64 - name: throughput_tok_per_sec dtype: float64 - name: energy_joules dtype: float64 - name: power_watts dtype: float64 - name: gpu_utilization_pct dtype: float64 - name: estimated_flops dtype: float64 - name: scoring_metadata dtype: string - name: error dtype: string splits: - name: deepresearch - name: gaia - name: liveresearchbench - name: pinchbench - name: taubench - name: toolcall15 - config_name: qwen-9b features: - name: record_id dtype: string - name: benchmark dtype: string - name: model dtype: string - name: backend dtype: string - name: problem dtype: string - name: reference dtype: string - name: model_answer dtype: string - name: is_correct dtype: bool - name: score dtype: float64 - name: latency_seconds dtype: float64 - name: prompt_tokens dtype: int64 - name: completion_tokens dtype: int64 - name: cost_usd dtype: float64 - name: throughput_tok_per_sec dtype: float64 - name: energy_joules dtype: float64 - name: power_watts dtype: float64 - name: gpu_utilization_pct dtype: float64 - name: estimated_flops dtype: float64 - name: scoring_metadata dtype: string - name: error dtype: string splits: - name: deepresearch - name: gaia - name: livecodebench - name: liveresearchbench - name: pinchbench - name: taubench - name: toolcall15 - config_name: trinity-large features: - name: record_id dtype: string - name: benchmark dtype: string - name: model dtype: string - name: backend dtype: string - name: problem dtype: string - name: reference dtype: string - name: model_answer dtype: string - name: is_correct dtype: bool - name: score dtype: float64 - name: latency_seconds dtype: float64 - name: prompt_tokens dtype: int64 - name: completion_tokens dtype: int64 - name: cost_usd dtype: float64 - name: throughput_tok_per_sec dtype: float64 - name: energy_joules dtype: float64 - name: power_watts dtype: float64 - name: gpu_utilization_pct dtype: float64 - name: estimated_flops dtype: float64 - name: scoring_metadata dtype: string - name: error dtype: string splits: - name: deepresearch - name: gaia - name: livecodebench - name: liveresearchbench - name: pinchbench - name: taubench - name: toolcall15 configs: - config_name: claude-opus data_files: - split: deepresearch path: claude-opus/deepresearch/* - split: gaia path: claude-opus/gaia/* - split: livecodebench path: claude-opus/livecodebench/* - split: liveresearch path: claude-opus/liveresearch/* - split: liveresearchbench path: claude-opus/liveresearchbench/* - split: pinchbench path: claude-opus/pinchbench/* - split: taubench path: claude-opus/taubench/* - split: taubench_telecom path: claude-opus/taubench_telecom/* - split: toolcall15 path: claude-opus/toolcall15/* - config_name: gemini-31-pro data_files: - split: deepresearch path: gemini-31-pro/deepresearch/* - split: gaia path: gemini-31-pro/gaia/* - split: livecodebench path: gemini-31-pro/livecodebench/* - split: liveresearch path: gemini-31-pro/liveresearch/* - split: liveresearchbench path: gemini-31-pro/liveresearchbench/* - split: pinchbench path: gemini-31-pro/pinchbench/* - split: taubench path: gemini-31-pro/taubench/* - split: taubench_telecom path: gemini-31-pro/taubench_telecom/* - split: toolcall15 path: gemini-31-pro/toolcall15/* - config_name: gemma4-e4b data_files: - split: deepresearch path: gemma4-e4b/deepresearch/* - split: gaia path: gemma4-e4b/gaia/* - split: livecodebench path: gemma4-e4b/livecodebench/* - split: liveresearchbench path: gemma4-e4b/liveresearchbench/* - split: pinchbench path: gemma4-e4b/pinchbench/* - split: taubench path: gemma4-e4b/taubench/* - split: toolcall15 path: gemma4-e4b/toolcall15/* - config_name: gpt-54 data_files: - split: deepresearch path: gpt-54/deepresearch/* - split: gaia path: gpt-54/gaia/* - split: livecodebench path: gpt-54/livecodebench/* - split: liveresearch path: gpt-54/liveresearch/* - split: liveresearchbench path: gpt-54/liveresearchbench/* - split: pinchbench path: gpt-54/pinchbench/* - split: taubench path: gpt-54/taubench/* - split: taubench_telecom path: gpt-54/taubench_telecom/* - split: toolcall15 path: gpt-54/toolcall15/* - config_name: kimi-k25 data_files: - split: gaia path: kimi-k25/gaia/* - split: livecodebench path: kimi-k25/livecodebench/* - split: pinchbench path: kimi-k25/pinchbench/* - split: toolcall15 path: kimi-k25/toolcall15/* - config_name: lfm-12b data_files: - split: deepresearch path: lfm-12b/deepresearch/* - split: gaia path: lfm-12b/gaia/* - split: livecodebench path: lfm-12b/livecodebench/* - split: liveresearchbench path: lfm-12b/liveresearchbench/* - split: pinchbench path: lfm-12b/pinchbench/* - split: taubench path: lfm-12b/taubench/* - split: toolcall15 path: lfm-12b/toolcall15/* - config_name: minimax-m25 data_files: - split: deepresearch path: minimax-m25/deepresearch/* - split: gaia path: minimax-m25/gaia/* - split: livecodebench path: minimax-m25/livecodebench/* - split: liveresearchbench path: minimax-m25/liveresearchbench/* - split: pinchbench path: minimax-m25/pinchbench/* - split: taubench path: minimax-m25/taubench/* - split: toolcall15 path: minimax-m25/toolcall15/* - config_name: nemotron-nano-30b data_files: - split: deepresearch path: nemotron-nano-30b/deepresearch/* - split: gaia path: nemotron-nano-30b/gaia/* - split: livecodebench path: nemotron-nano-30b/livecodebench/* - split: liveresearchbench path: nemotron-nano-30b/liveresearchbench/* - split: pinchbench path: nemotron-nano-30b/pinchbench/* - split: taubench path: nemotron-nano-30b/taubench/* - split: toolcall15 path: nemotron-nano-30b/toolcall15/* - config_name: nemotron-nano-4b-fp8 data_files: - split: deepresearch path: nemotron-nano-4b-fp8/deepresearch/* - split: gaia path: nemotron-nano-4b-fp8/gaia/* - split: livecodebench path: nemotron-nano-4b-fp8/livecodebench/* - split: liveresearchbench path: nemotron-nano-4b-fp8/liveresearchbench/* - split: pinchbench path: nemotron-nano-4b-fp8/pinchbench/* - split: taubench path: nemotron-nano-4b-fp8/taubench/* - split: toolcall15 path: nemotron-nano-4b-fp8/toolcall15/* - config_name: qwen-27b data_files: - split: deepresearch path: qwen-27b/deepresearch/* - split: gaia path: qwen-27b/gaia/* - split: livecodebench path: qwen-27b/livecodebench/* - split: pinchbench path: qwen-27b/pinchbench/* - split: taubench path: qwen-27b/taubench/* - split: toolcall15 path: qwen-27b/toolcall15/* - config_name: qwen-2b data_files: - split: deepresearch path: qwen-2b/deepresearch/* - split: gaia path: qwen-2b/gaia/* - split: livecodebench path: qwen-2b/livecodebench/* - split: liveresearchbench path: qwen-2b/liveresearchbench/* - split: pinchbench path: qwen-2b/pinchbench/* - split: taubench path: qwen-2b/taubench/* - split: toolcall15 path: qwen-2b/toolcall15/* - config_name: qwen-397b data_files: - split: deepresearch path: qwen-397b/deepresearch/* - split: gaia path: qwen-397b/gaia/* - split: livecodebench path: qwen-397b/livecodebench/* - split: liveresearchbench path: qwen-397b/liveresearchbench/* - split: pinchbench path: qwen-397b/pinchbench/* - split: taubench path: qwen-397b/taubench/* - split: toolcall15 path: qwen-397b/toolcall15/* - config_name: qwen-4b data_files: - split: deepresearch path: qwen-4b/deepresearch/* - split: gaia path: qwen-4b/gaia/* - split: liveresearchbench path: qwen-4b/liveresearchbench/* - split: pinchbench path: qwen-4b/pinchbench/* - split: taubench path: qwen-4b/taubench/* - split: toolcall15 path: qwen-4b/toolcall15/* - config_name: qwen-9b data_files: - split: deepresearch path: qwen-9b/deepresearch/* - split: gaia path: qwen-9b/gaia/* - split: livecodebench path: qwen-9b/livecodebench/* - split: liveresearchbench path: qwen-9b/liveresearchbench/* - split: pinchbench path: qwen-9b/pinchbench/* - split: taubench path: qwen-9b/taubench/* - split: toolcall15 path: qwen-9b/toolcall15/* - config_name: trinity-large data_files: - split: deepresearch path: trinity-large/deepresearch/* - split: gaia path: trinity-large/gaia/* - split: livecodebench path: trinity-large/livecodebench/* - split: liveresearchbench path: trinity-large/liveresearchbench/* - split: pinchbench path: trinity-large/pinchbench/* - split: taubench path: trinity-large/taubench/* - split: toolcall15 path: trinity-large/toolcall15/* --- # NeurIPS 2026 Agent Evaluation Dataset This dataset contains evaluation results for various AI agents across multiple benchmarks. ## Dataset Structure The dataset is organized by model (as configs) with each benchmark as a split. Each model/benchmark folder contains: - Main results file (`.jsonl` or `.parquet` format) - Summary statistics (`.summary.json`) - for models with metadata - Configuration file (`.toml`) - for models with metadata - Traces folder with execution traces (`traces/traces.jsonl`) - for models with metadata ## Models - `claude-opus`: 9 benchmarks - `gemini-31-pro`: 9 benchmarks - `gemma4-e4b`: 7 benchmarks - `gpt-54`: 9 benchmarks - `kimi-k25`: 4 benchmarks - `lfm-12b`: 7 benchmarks - `minimax-m25`: 7 benchmarks - `nemotron-nano-30b`: 7 benchmarks - `nemotron-nano-4b-fp8`: 7 benchmarks - `qwen-27b`: 6 benchmarks - `qwen-2b`: 7 benchmarks - `qwen-397b`: 7 benchmarks - `qwen-4b`: 6 benchmarks - `qwen-9b`: 7 benchmarks - `trinity-large`: 7 benchmarks ## Benchmarks - `deepresearch` - `gaia` - `livecodebench` - `liveresearch` - `liveresearchbench` - `pinchbench` - `taubench` - `taubench_telecom` - `terminalbench` - `toolcall15` ## Usage ```python from datasets import load_dataset # Load a specific model's results dataset = load_dataset('akenginorhun/neurips-2026-evals', name='claude-opus') # Load a specific benchmark for a model dataset = load_dataset('akenginorhun/neurips-2026-evals', name='qwen-27b', split='gaia') ```
提供机构:
akenginorhun
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作