Hard2Verify
收藏魔搭社区2025-11-27 更新2025-11-03 收录
下载链接:
https://modelscope.cn/datasets/Salesforce/Hard2Verify
下载链接
链接失效反馈官方服务:
资源简介:
# Hard2Verify: A Step-Level Verification Benchmark for Open-Ended Frontier Math
> Large language model (LLM)-based reasoning systems have recently achieved gold medal-level performance in the IMO 2025 competition, writing mathematical proofs where, to receive full credit, each step must be not only correct but also sufficiently supported. To train LLM-based reasoners in such challenging, open-ended settings, strong verifiers capable of catching step-level mistakes are necessary prerequisites. We introduce Hard2Verify, a human-annotated, step-level verification benchmark produced with over 500 hours of human labor. Hard2Verify is designed to rigorously assess step-level verifiers at the frontier: Verifiers must provide step-level annotations or identify the first error in responses generated by frontier LLMs for very recent, challenging, and open-ended math questions. We evaluate 29 generative critics and process reward models, demonstrating that, beyond a few standouts, open-source verifiers lag closed source models. We subsequently analyze what drives poor performance in step-level verification, the impacts of scaling verifier compute, as well as fundamental questions such as self-verification and verification-generation dynamics.
*Authors: Shrey Pandit*, Austin Xu*, Xuan-Phi Nguyen, Yifei Ming, Caiming Xiong, Shafiq Joty*
This repository contains the evaluation data Hard2Verify
- Paper: https://arxiv.org/abs/2510.13744
- Evaluation code: https://github.com/SalesforceAIResearch/Hard2Verify
Note: This dataset was generated using GPT, Gemini, and Claude and should not be used to develop competing products.
## Sample Usage
To load and decrypt the dataset:
```python
# git clone https://github.com/SalesforceAIResearch/Hard2Verify
from utils import decrypt_sample # utils.py from Github repo
from datasets import load_dataset
ds = load_dataset("Salesforce/Hard2Verify", split="test")
ds = ds.map(decrypt_sample)
# You can then access the decrypted samples, for example:
# for sample in ds:
# print(f"Question: {sample['question']}")
# print(f"Model Response Steps: {sample['model_response_by_step']}")
# print(f"Human Labels: {sample['human_labels']}")
# print(f"First Error Index: {sample['human_labels_first_error_idx']}")
```
## Dataset sample format
Once decrypted, each row in the dataset will contain the following:
```
{
'unique_id': str: Contains information about source olympiad and generator model
'question': str: Original math problem,
'model_response_by_step': list: Model solution, broken down by step,
'human_labels': list: A human correctness label corresponding to each step (0 = incorrect, 1 = correct),
'human_labels_first_error_idx': int: index of first error, as determined by human_labels (0-indexed); -1 means no error,
}
```
## Ethics disclaimer for Salesforce AI models, data, code
This release is for research purposes only in support of an academic paper. Our models, datasets, and code are not specifically designed or evaluated for all downstream purposes. We strongly recommend users evaluate and address potential concerns related to accuracy, safety, and fairness before deploying this model. We encourage users to consider the common limitations of AI, comply with applicable laws, and leverage best practices when selecting use cases, particularly for high-risk scenarios where errors or misuse could significantly impact people’s lives, rights, or safety. For further guidance on use cases, refer to our standard [AUP](https://www.salesforce.com/content/dam/web/en_us/www/documents/legal/Agreements/policies/ExternalFacing_Services_Policy.pdf) and [AI AUP](https://www.salesforce.com/content/dam/web/en_us/www/documents/legal/Agreements/policies/ai-acceptable-use-policy.pdf).
## Citation
```
@misc{pandit2025hard,
title={Hard2Verify: A Step-Level Verification Benchmark for Open-Ended Frontier Math},
author={Pandit, Shrey and Xu, Austin and Nguyen, Xuan-Phi and Ming, Yifei and Xiong, Caiming and Joty, Shafiq},
year={2025},
journal={arXiv preprint arXiv:2510.13744},
}
```
# Hard2Verify:面向开放型前沿数学的步骤级验证基准(Hard2Verify)
> 基于大语言模型(Large Language Model,LLM)的推理系统近日在2025年国际数学奥林匹克(IMO)竞赛中斩获金牌级表现,能够撰写数学证明——而要获得满分,证明中的每一步不仅需要正确,还需具备充分的支撑依据。为了在这类极具挑战性的开放型任务场景中训练基于大语言模型的推理器,能够识别步骤级错误的高性能验证器是必要的前提条件。我们推出了Hard2Verify(Hard2Verify)——一个经人工标注的步骤级验证基准数据集,累计投入超过500小时的人工劳作完成构建。Hard2Verify旨在严格评估前沿场景下的步骤级验证器:验证器需要针对最新、极具挑战性且开放型的数学问题,为前沿大语言模型生成的回答提供步骤级标注,或定位其中的首个错误。我们对29个生成式评判器与过程奖励模型进行了评估,结果显示,除少数表现突出者外,开源验证器的性能落后于闭源模型。随后我们分析了导致步骤级验证性能不佳的因素、验证器计算规模扩展带来的影响,以及诸如自我验证、验证与生成动态关系等基础问题。
*作者:Shrey Pandit*、Austin Xu*、Xuan-Phi Nguyen、Yifei Ming、Caiming Xiong、Shafiq Joty*
本仓库包含Hard2Verify验证数据集
- 论文链接:https://arxiv.org/abs/2510.13744
- 评估代码链接:https://github.com/SalesforceAIResearch/Hard2Verify
注意:本数据集由GPT、Gemini与Claude生成,不得用于开发竞品。
## 示例用法
要加载并解密数据集,请执行以下操作:
python
# 克隆仓库:git clone https://github.com/SalesforceAIResearch/Hard2Verify
from utils import decrypt_sample # 来自GitHub仓库的utils.py文件
from datasets import load_dataset
ds = load_dataset("Salesforce/Hard2Verify", split="test")
ds = ds.map(decrypt_sample)
# 随后即可访问解密后的样本,示例如下:
# for sample in ds:
# print(f"Question: {sample['question']}")
# print(f"Model Response Steps: {sample['model_response_by_step']}")
# print(f"Human Labels: {sample['human_labels']}")
# print(f"First Error Index: {sample['human_labels_first_error_idx']}")
## 数据集样本格式
解密后,数据集中的每一行将包含以下字段:
{
'unique_id': str: 包含赛事来源与生成模型的相关信息,
'question': str: 原始数学问题,
'model_response_by_step': list: 按步骤拆分的模型解答,
'human_labels': list: 对应每个步骤的人工正确性标注(0表示错误,1表示正确),
'human_labels_first_error_idx': int: 由人工标注确定的首个错误所在的索引(从0开始计数);-1表示无错误,
}
## Salesforce AI模型、数据集与代码伦理声明
本发布内容仅用于支持学术论文的研究用途。我们的模型、数据集与代码并非针对所有下游应用场景专门设计或评估。我们强烈建议用户在部署本模型前,对其准确性、安全性与公平性等潜在问题进行评估并妥善处理。我们鼓励用户考量人工智能的普遍局限性,遵守适用法律法规,并在选择应用场景时遵循最佳实践,尤其是在错误或滥用可能严重影响人们生命、权利或安全的高风险场景中。如需了解应用场景的进一步指导,请参阅我们的标准[AUP](https://www.salesforce.com/content/dam/web/en_us/www/documents/legal/Agreements/policies/ExternalFacing_Services_Policy.pdf)与[AI AUP](https://www.salesforce.com/content/dam/web/en_us/www/documents/legal/Agreements/policies/ai-acceptable-use-policy.pdf)。
## 引用格式
@misc{pandit2025hard,
title={Hard2Verify: A Step-Level Verification Benchmark for Open-Ended Frontier Math},
author={Pandit, Shrey and Xu, Austin and Nguyen, Xuan-Phi and Ming, Yifei and Xiong, Caiming and Joty, Shafiq},
year={2025},
journal={arXiv preprint arXiv:2510.13744},
}
提供机构:
maas
创建时间:
2025-10-16



