SOS-Bench
收藏arXiv2025-09-30 收录
下载链接:
https://github.com/penfever/sos-bench
下载链接
链接失效反馈官方服务:
资源简介:
该数据集是目前最大的标准化、可复制的LLM元基准测试,旨在通过多种评价指标来评估LLM的对齐情况。该数据集突出了数据规模扩大和提示多样性在LLM后训练阶段的监督微调中对效果的影响。其任务是通过安全性、世界知识和指令遵循等方面来评估LLM的对齐情况。
This dataset is the largest standardized and reproducible LLM meta-benchmark to date, which aims to evaluate the alignment of large language models (LLMs) via multiple evaluation metrics. It emphasizes the impact of data scaling and prompt diversity on model performance during supervised fine-tuning (SFT) in the post-training phase of LLMs. Its core task is to assess LLM alignment across dimensions including safety, world knowledge, and instruction following.



