xbench/ScienceQA
收藏Hugging Face2025-06-18 更新2025-07-05 收录
下载链接:
https://hf-mirror.com/datasets/xbench/ScienceQA
下载链接
链接失效反馈官方服务:
资源简介:
xbench是一个持续更新、无污染、真实世界的、特定领域的人工智能评估框架。它旨在通过两个互补的赛道来衡量AI系统的智能前沿和实际应用效用:AGI Tracking赛道衡量模型的核心能力,如推理、工具使用和记忆;而Professional Aligned赛道则是一类新的评估,基于工作流程、环境和业务KPI,与领域专家共同设计。数据集开源了ScienceQA和DeepSearch两个AGI Tracking基准的源数据和评估代码。
xbench is an evergreen, contamination-free, real-world, domain-specific AI evaluation framework designed to measure both the intelligence frontier and real-world utility of AI systems. It features two complementary tracks: AGI Tracking, which measures core model capabilities like reasoning, tool-use, and memory, and Profession Aligned, a new class of evals grounded in workflows, environments, and business KPIs, co-designed with domain experts. The dataset includes source data and evaluation code for two AGI Tracking benchmarks: ScienceQA and DeepSearch.
提供机构:
xbench



