ZeroSumEval Games
收藏arXiv2025-09-30 收录
下载链接:
https://github.com/facebookresearch/ZeroSumEval
下载链接
链接失效反馈官方服务:
资源简介:
该数据集是一套多样化的游戏集合,旨在通过动态基准测试来评估大型语言模型(LLM)的能力,其中包括安全性挑战、经典游戏、知识测试和说服力挑战。该数据集还包括了模型在循环赛制中参与各种游戏的竞争,以评估它们的表现和能力。规模上,该数据集在7款游戏和13种模型上进行了超过7000次的模拟实验。任务是通过竞争性游戏玩法来评估人工智能的能力,如战略推理、规划、知识应用和创造力。
This dataset is a diverse collection of games designed to evaluate the capabilities of Large Language Models (LLMs) via dynamic benchmarking, encompassing safety challenges, classic games, knowledge tests, and persuasion challenges. It also includes model competitions across various games under a round-robin format to assess their performance and capabilities. In terms of scale, the dataset contains over 7,000 simulation experiments conducted across 7 games and 13 model variants. The core task of this benchmark is to evaluate AI capabilities through competitive gameplay, including strategic reasoning, planning, knowledge application, and creativity.
提供机构:
Facebook Research



