T-math
收藏魔搭社区2025-12-05 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/t-tech/T-math
下载链接
链接失效反馈官方服务:
资源简介:
# 🧮 T-Math
**T-Math** is a dataset of Russian math olympiad problems created to assess the reasoning capabilities of large language models (LLMs) in mathematics.
It includes 331 problems from the [All-Russian School Olympiad](https://vos.olimpiada.ru/) and the [Moscow Olympiad](https://mos.olimpiada.ru) for high school students, covering the period from 1998 to 2025.
The tasks and their ground-truth answers were extracted automatically and subsequently verified by human assessors.
Key features:
- Challenging problems that require multi-step reasoning (median completion length for Qwen3-32B is 16K tokens), sourced from top-tier Russian olympiads
- Easily verifiable: answers are numeric-only and checked using the `math_verify` library to compare mathematical expressions
- Not yet saturated, even by frontier reasoning models such as Gemini 2.5 Pro and DeepSeek R1
- Contains 331 samples — the largest Russian math olympiad-level benchmark — making it more statistically robust compared to smaller datasets like the 30-sample AIME benchmark
## 📊 Evaluation Results
|Model|pass@1|
|--|--|
|o4-mini-high|**0.73**|
|DeepSeek-R1-0528|<ins>0.71</ins>|
|Gemini-2.5-Pro|0.70|
|Claude Sonnet 4|0.56|
|T-pro-it-2.0|0.54|
|Qwen3-32B|0.53|
## 🗂️ Filtering procedure
The text was extracted from PDFs using [Qwen/Qwen2.5-VL-72B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct). Tasks, along with their ground-truth and verifiable (numeric) answers, were extracted via LLM calls.
We filtered out invalid questions using an LLM based on the following criteria:
- Tasks requiring multiple answers
- Tasks without a single correct answer
- Theorem-like tasks where the main goal is proving a statement, making automatic verification non-trivial
- Tasks with non-numeric answers, to simplify answer comparison
- Tasks that cannot be solved without access to an accompanying image
Next, we removed tasks of moderate difficulty where Qwen3-8B achieved a 100% pass@16 rate, as they offer limited value for benchmarking reasoning.
Finally, both the questions and the verifiable answers were manually reviewed by assessors to ensure consistency with the original sources.
## 🛠️ How to use
Add the following system prompt to guide the model to return the final answer in a \boxed{} tag, making it easier to parse:
```
Решите следующую математическую задачу эффективно и ясно. Последняя строка вашего ответа должна иметь следующий формат:
'Таким образом, окончательный ответ: $\boxed{ОТВЕТ}$.' (без кавычек), где ОТВЕТ - это просто окончательное число или выражение, решающее задачу.
Думайте шаг за шагом перед ответом.
```
You can then use the following code snippet with the math_verify library to compare mathematical expressions:
```python
from math_verify import LatexExtractionConfig, parse, verify
from latex2sympy2_extended import NormalizationConfig
def accuracy_reward(completion: str, solution: str) -> float:
"""Reward function that checks if the completion matches the ground truth."""
# parse the gold solution (assumed to always succeed)
gold_parsed = parse(solution, extraction_mode="first_match")
# parse the model’s completion with the same LaTeX extraction settings
answer_parsed = parse(
completion,
extraction_config=[
LatexExtractionConfig(
normalization_config=NormalizationConfig(
nits=False,
malformed_operators=False,
basic_latex=True,
equations=True,
boxed="all",
units=True,
)
)
],
extraction_mode="first_match",
)
# verify and return binary reward; on error, print and give 0.0
try:
return float(verify(gold_parsed, answer_parsed))
except Exception as e:
print(f"verify failed: {e}, answer: {answer_parsed}, gold: {gold_parsed}")
return 0.0
```
# 🧮 T-Math
**T-Math** 是一套用于评估大语言模型(Large Language Models,LLMs)数学推理能力的俄罗斯数学奥林匹克竞赛试题集。
该数据集包含331道来自全俄中学生数学奥林匹克竞赛(All-Russian School Olympiad,官网:https://vos.olimpiada.ru/)与莫斯科数学奥林匹克竞赛(Moscow Olympiad,官网:https://mos.olimpiada.ru)的试题,涵盖1998年至2025年的赛事题目。
所有试题及其标准答案均通过自动化方式提取,随后由人工审核人员完成核验。
### 核心特性
- 题目兼具挑战性,需要多步推理(Qwen3-32B的平均作答长度中位数为16K Token),均选自顶级俄罗斯数学奥林匹克赛事;
- 易于验证:答案均为纯数值形式,可通过`math_verify`库比对数学表达式完成校验;
- 即便当前前沿推理模型(如Gemini 2.5 Pro与DeepSeek R1)也尚未完全攻克该数据集;
- 共包含331个样本,是目前规模最大的俄罗斯奥林匹克数学级基准测试集,相比仅30个样本的美国数学邀请赛(AIME)基准数据集,具备更优异的统计稳健性。
## 📊 评估结果
| 模型 | 单样本通过率(pass@1) |
| -- | -- |
| o4-mini-high | **0.73** |
| DeepSeek-R1-0528 | <ins>0.71</ins> |
| Gemini-2.5-Pro | 0.70 |
| Claude Sonnet 4 | 0.56 |
| T-pro-it-2.0 | 0.54 |
| Qwen3-32B | 0.53 |
## 🗂️ 筛选流程
文本首先通过[Qwen/Qwen2.5-VL-72B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct)从PDF中提取。试题及其标准答案、可验证的数值型答案均通过大语言模型调用完成提取。
我们基于以下准则利用大语言模型过滤无效试题:
- 需要多个答案的试题;
- 不存在唯一正确答案的试题;
- 以证明某一命题为核心目标的定理类试题,这类试题难以实现自动化验证;
- 答案非数值形式的试题,以简化答案比对流程;
- 必须依赖配套图片才能解答的试题。
随后,我们移除了Qwen3-8B模型达到100% pass@16通过率的中等难度试题,因为这类试题对推理能力基准测试的参考价值有限。
最后,所有试题与可验证答案均由人工审核人员进行复核,确保与原始赛事资料一致。
## 🛠️ 使用方法
添加以下系统提示词,引导模型以`oxed{}`标签格式返回最终答案,便于解析:
Решите следующую математическую задачу эффективно и ясно. Последняя строка вашего ответа должна иметь следующий формат:
'Таким образом, окончательный ответ: $oxed{ОТВЕТ}$.' (без кавычек), где ОТВЕТ - это просто окончательное число или выражение, решающее задачу.
Думайте шаг за шагом перед ответом.
随后可借助以下代码片段结合`math_verify`库完成数学表达式比对:
python
from math_verify import LatexExtractionConfig, parse, verify
from latex2sympy2_extended import NormalizationConfig
def accuracy_reward(completion: str, solution: str) -> float:
"""用于校验模型输出是否与标准答案匹配的奖励函数"""
# 解析标准答案(假设解析始终成功)
gold_parsed = parse(solution, extraction_mode="first_match")
# 以相同的LaTeX提取设置解析模型输出
answer_parsed = parse(
completion,
extraction_config=[
LatexExtractionConfig(
normalization_config=NormalizationConfig(
nits=False,
malformed_operators=False,
basic_latex=True,
equations=True,
boxed="all",
units=True,
)
)
],
extraction_mode="first_match",
)
# 验证结果并返回二元奖励;若出现异常则打印信息并返回0.0
try:
return float(verify(gold_parsed, answer_parsed))
except Exception as e:
print(f"验证失败: {e}, 模型输出答案: {answer_parsed}, 标准答案: {gold_parsed}")
return 0.0
提供机构:
maas
创建时间:
2025-07-19



