T-math

Name: T-math
Creator: maas
Published: 2025-12-05 16:42:32
License: 暂无描述

魔搭社区2025-12-05 更新2025-12-06 收录

下载链接：

https://modelscope.cn/datasets/t-tech/T-math

下载链接

链接失效反馈

官方服务：

资源简介：

# 🧮 T-Math **T-Math** is a dataset of Russian math olympiad problems created to assess the reasoning capabilities of large language models (LLMs) in mathematics. It includes 331 problems from the [All-Russian School Olympiad](https://vos.olimpiada.ru/) and the [Moscow Olympiad](https://mos.olimpiada.ru) for high school students, covering the period from 1998 to 2025. The tasks and their ground-truth answers were extracted automatically and subsequently verified by human assessors. Key features: - Challenging problems that require multi-step reasoning (median completion length for Qwen3-32B is 16K tokens), sourced from top-tier Russian olympiads - Easily verifiable: answers are numeric-only and checked using the `math_verify` library to compare mathematical expressions - Not yet saturated, even by frontier reasoning models such as Gemini 2.5 Pro and DeepSeek R1 - Contains 331 samples — the largest Russian math olympiad-level benchmark — making it more statistically robust compared to smaller datasets like the 30-sample AIME benchmark ## 📊 Evaluation Results |Model|pass@1| |--|--| |o4-mini-high|**0.73**| |DeepSeek-R1-0528|<ins>0.71</ins>| |Gemini-2.5-Pro|0.70| |Claude Sonnet 4|0.56| |T-pro-it-2.0|0.54| |Qwen3-32B|0.53| ## 🗂️ Filtering procedure The text was extracted from PDFs using [Qwen/Qwen2.5-VL-72B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct). Tasks, along with their ground-truth and verifiable (numeric) answers, were extracted via LLM calls. We filtered out invalid questions using an LLM based on the following criteria: - Tasks requiring multiple answers - Tasks without a single correct answer - Theorem-like tasks where the main goal is proving a statement, making automatic verification non-trivial - Tasks with non-numeric answers, to simplify answer comparison - Tasks that cannot be solved without access to an accompanying image Next, we removed tasks of moderate difficulty where Qwen3-8B achieved a 100% pass@16 rate, as they offer limited value for benchmarking reasoning. Finally, both the questions and the verifiable answers were manually reviewed by assessors to ensure consistency with the original sources. ## 🛠️ How to use Add the following system prompt to guide the model to return the final answer in a \boxed{} tag, making it easier to parse: ``` Решите следующую математическую задачу эффективно и ясно. Последняя строка вашего ответа должна иметь следующий формат: 'Таким образом, окончательный ответ: $\boxed{ОТВЕТ}$.' (без кавычек), где ОТВЕТ - это просто окончательное число или выражение, решающее задачу. Думайте шаг за шагом перед ответом. ``` You can then use the following code snippet with the math_verify library to compare mathematical expressions: ```python from math_verify import LatexExtractionConfig, parse, verify from latex2sympy2_extended import NormalizationConfig def accuracy_reward(completion: str, solution: str) -> float: """Reward function that checks if the completion matches the ground truth.""" # parse the gold solution (assumed to always succeed) gold_parsed = parse(solution, extraction_mode="first_match") # parse the model’s completion with the same LaTeX extraction settings answer_parsed = parse( completion, extraction_config=[ LatexExtractionConfig( normalization_config=NormalizationConfig( nits=False, malformed_operators=False, basic_latex=True, equations=True, boxed="all", units=True, ) ) ], extraction_mode="first_match", ) # verify and return binary reward; on error, print and give 0.0 try: return float(verify(gold_parsed, answer_parsed)) except Exception as e: print(f"verify failed: {e}, answer: {answer_parsed}, gold: {gold_parsed}") return 0.0 ```

# 🧮 T-Math **T-Math** 是一套用于评估大语言模型（Large Language Models，LLMs）数学推理能力的俄罗斯数学奥林匹克竞赛试题集。该数据集包含331道来自全俄中学生数学奥林匹克竞赛（All-Russian School Olympiad，官网：https://vos.olimpiada.ru/）与莫斯科数学奥林匹克竞赛（Moscow Olympiad，官网：https://mos.olimpiada.ru）的试题，涵盖1998年至2025年的赛事题目。所有试题及其标准答案均通过自动化方式提取，随后由人工审核人员完成核验。 ### 核心特性 - 题目兼具挑战性，需要多步推理（Qwen3-32B的平均作答长度中位数为16K Token），均选自顶级俄罗斯数学奥林匹克赛事； - 易于验证：答案均为纯数值形式，可通过`math_verify`库比对数学表达式完成校验； - 即便当前前沿推理模型（如Gemini 2.5 Pro与DeepSeek R1）也尚未完全攻克该数据集； - 共包含331个样本，是目前规模最大的俄罗斯奥林匹克数学级基准测试集，相比仅30个样本的美国数学邀请赛（AIME）基准数据集，具备更优异的统计稳健性。 ## 📊 评估结果 | 模型 | 单样本通过率（pass@1） | | -- | -- | | o4-mini-high | **0.73** | | DeepSeek-R1-0528 | <ins>0.71</ins> | | Gemini-2.5-Pro | 0.70 | | Claude Sonnet 4 | 0.56 | | T-pro-it-2.0 | 0.54 | | Qwen3-32B | 0.53 | ## 🗂️ 筛选流程文本首先通过[Qwen/Qwen2.5-VL-72B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct)从PDF中提取。试题及其标准答案、可验证的数值型答案均通过大语言模型调用完成提取。我们基于以下准则利用大语言模型过滤无效试题： - 需要多个答案的试题； - 不存在唯一正确答案的试题； - 以证明某一命题为核心目标的定理类试题，这类试题难以实现自动化验证； - 答案非数值形式的试题，以简化答案比对流程； - 必须依赖配套图片才能解答的试题。随后，我们移除了Qwen3-8B模型达到100% pass@16通过率的中等难度试题，因为这类试题对推理能力基准测试的参考价值有限。最后，所有试题与可验证答案均由人工审核人员进行复核，确保与原始赛事资料一致。 ## 🛠️ 使用方法添加以下系统提示词，引导模型以`oxed{}`标签格式返回最终答案，便于解析： Решите следующую математическую задачу эффективно и ясно. Последняя строка вашего ответа должна иметь следующий формат: 'Таким образом, окончательный ответ: $oxed{ОТВЕТ}$.' (без кавычек), где ОТВЕТ - это просто окончательное число или выражение, решающее задачу. Думайте шаг за шагом перед ответом. 随后可借助以下代码片段结合`math_verify`库完成数学表达式比对： python from math_verify import LatexExtractionConfig, parse, verify from latex2sympy2_extended import NormalizationConfig def accuracy_reward(completion: str, solution: str) -> float: """用于校验模型输出是否与标准答案匹配的奖励函数""" # 解析标准答案（假设解析始终成功） gold_parsed = parse(solution, extraction_mode="first_match") # 以相同的LaTeX提取设置解析模型输出 answer_parsed = parse( completion, extraction_config=[ LatexExtractionConfig( normalization_config=NormalizationConfig( nits=False, malformed_operators=False, basic_latex=True, equations=True, boxed="all", units=True, ) ) ], extraction_mode="first_match", ) # 验证结果并返回二元奖励；若出现异常则打印信息并返回0.0 try: return float(verify(gold_parsed, answer_parsed)) except Exception as e: print(f"验证失败: {e}, 模型输出答案: {answer_parsed}, 标准答案: {gold_parsed}") return 0.0

提供机构：

maas

创建时间：

2025-07-19

5,000+

优质数据集

54 个

任务类型

进入经典数据集