five

T-math

收藏
魔搭社区2025-12-05 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/t-tech/T-math
下载链接
链接失效反馈
官方服务:
资源简介:
# 🧮 T-Math **T-Math** is a dataset of Russian math olympiad problems created to assess the reasoning capabilities of large language models (LLMs) in mathematics. It includes 331 problems from the [All-Russian School Olympiad](https://vos.olimpiada.ru/) and the [Moscow Olympiad](https://mos.olimpiada.ru) for high school students, covering the period from 1998 to 2025. The tasks and their ground-truth answers were extracted automatically and subsequently verified by human assessors. Key features: - Challenging problems that require multi-step reasoning (median completion length for Qwen3-32B is 16K tokens), sourced from top-tier Russian olympiads - Easily verifiable: answers are numeric-only and checked using the `math_verify` library to compare mathematical expressions - Not yet saturated, even by frontier reasoning models such as Gemini 2.5 Pro and DeepSeek R1 - Contains 331 samples — the largest Russian math olympiad-level benchmark — making it more statistically robust compared to smaller datasets like the 30-sample AIME benchmark ## 📊 Evaluation Results |Model|pass@1| |--|--| |o4-mini-high|**0.73**| |DeepSeek-R1-0528|<ins>0.71</ins>| |Gemini-2.5-Pro|0.70| |Claude Sonnet 4|0.56| |T-pro-it-2.0|0.54| |Qwen3-32B|0.53| ## 🗂️ Filtering procedure The text was extracted from PDFs using [Qwen/Qwen2.5-VL-72B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct). Tasks, along with their ground-truth and verifiable (numeric) answers, were extracted via LLM calls. We filtered out invalid questions using an LLM based on the following criteria: - Tasks requiring multiple answers - Tasks without a single correct answer - Theorem-like tasks where the main goal is proving a statement, making automatic verification non-trivial - Tasks with non-numeric answers, to simplify answer comparison - Tasks that cannot be solved without access to an accompanying image Next, we removed tasks of moderate difficulty where Qwen3-8B achieved a 100% pass@16 rate, as they offer limited value for benchmarking reasoning. Finally, both the questions and the verifiable answers were manually reviewed by assessors to ensure consistency with the original sources. ## 🛠️ How to use Add the following system prompt to guide the model to return the final answer in a \boxed{} tag, making it easier to parse: ``` Решите следующую математическую задачу эффективно и ясно. Последняя строка вашего ответа должна иметь следующий формат: 'Таким образом, окончательный ответ: $\boxed{ОТВЕТ}$.' (без кавычек), где ОТВЕТ - это просто окончательное число или выражение, решающее задачу. Думайте шаг за шагом перед ответом. ``` You can then use the following code snippet with the math_verify library to compare mathematical expressions: ```python from math_verify import LatexExtractionConfig, parse, verify from latex2sympy2_extended import NormalizationConfig def accuracy_reward(completion: str, solution: str) -> float: """Reward function that checks if the completion matches the ground truth.""" # parse the gold solution (assumed to always succeed) gold_parsed = parse(solution, extraction_mode="first_match") # parse the model’s completion with the same LaTeX extraction settings answer_parsed = parse( completion, extraction_config=[ LatexExtractionConfig( normalization_config=NormalizationConfig( nits=False, malformed_operators=False, basic_latex=True, equations=True, boxed="all", units=True, ) ) ], extraction_mode="first_match", ) # verify and return binary reward; on error, print and give 0.0 try: return float(verify(gold_parsed, answer_parsed)) except Exception as e: print(f"verify failed: {e}, answer: {answer_parsed}, gold: {gold_parsed}") return 0.0 ```

# 🧮 T-Math **T-Math** 是一套用于评估大语言模型(Large Language Models,LLMs)数学推理能力的俄罗斯数学奥林匹克竞赛试题集。 该数据集包含331道来自全俄中学生数学奥林匹克竞赛(All-Russian School Olympiad,官网:https://vos.olimpiada.ru/)与莫斯科数学奥林匹克竞赛(Moscow Olympiad,官网:https://mos.olimpiada.ru)的试题,涵盖1998年至2025年的赛事题目。 所有试题及其标准答案均通过自动化方式提取,随后由人工审核人员完成核验。 ### 核心特性 - 题目兼具挑战性,需要多步推理(Qwen3-32B的平均作答长度中位数为16K Token),均选自顶级俄罗斯数学奥林匹克赛事; - 易于验证:答案均为纯数值形式,可通过`math_verify`库比对数学表达式完成校验; - 即便当前前沿推理模型(如Gemini 2.5 Pro与DeepSeek R1)也尚未完全攻克该数据集; - 共包含331个样本,是目前规模最大的俄罗斯奥林匹克数学级基准测试集,相比仅30个样本的美国数学邀请赛(AIME)基准数据集,具备更优异的统计稳健性。 ## 📊 评估结果 | 模型 | 单样本通过率(pass@1) | | -- | -- | | o4-mini-high | **0.73** | | DeepSeek-R1-0528 | <ins>0.71</ins> | | Gemini-2.5-Pro | 0.70 | | Claude Sonnet 4 | 0.56 | | T-pro-it-2.0 | 0.54 | | Qwen3-32B | 0.53 | ## 🗂️ 筛选流程 文本首先通过[Qwen/Qwen2.5-VL-72B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct)从PDF中提取。试题及其标准答案、可验证的数值型答案均通过大语言模型调用完成提取。 我们基于以下准则利用大语言模型过滤无效试题: - 需要多个答案的试题; - 不存在唯一正确答案的试题; - 以证明某一命题为核心目标的定理类试题,这类试题难以实现自动化验证; - 答案非数值形式的试题,以简化答案比对流程; - 必须依赖配套图片才能解答的试题。 随后,我们移除了Qwen3-8B模型达到100% pass@16通过率的中等难度试题,因为这类试题对推理能力基准测试的参考价值有限。 最后,所有试题与可验证答案均由人工审核人员进行复核,确保与原始赛事资料一致。 ## 🛠️ 使用方法 添加以下系统提示词,引导模型以`oxed{}`标签格式返回最终答案,便于解析: Решите следующую математическую задачу эффективно и ясно. Последняя строка вашего ответа должна иметь следующий формат: 'Таким образом, окончательный ответ: $oxed{ОТВЕТ}$.' (без кавычек), где ОТВЕТ - это просто окончательное число или выражение, решающее задачу. Думайте шаг за шагом перед ответом. 随后可借助以下代码片段结合`math_verify`库完成数学表达式比对: python from math_verify import LatexExtractionConfig, parse, verify from latex2sympy2_extended import NormalizationConfig def accuracy_reward(completion: str, solution: str) -> float: """用于校验模型输出是否与标准答案匹配的奖励函数""" # 解析标准答案(假设解析始终成功) gold_parsed = parse(solution, extraction_mode="first_match") # 以相同的LaTeX提取设置解析模型输出 answer_parsed = parse( completion, extraction_config=[ LatexExtractionConfig( normalization_config=NormalizationConfig( nits=False, malformed_operators=False, basic_latex=True, equations=True, boxed="all", units=True, ) ) ], extraction_mode="first_match", ) # 验证结果并返回二元奖励;若出现异常则打印信息并返回0.0 try: return float(verify(gold_parsed, answer_parsed)) except Exception as e: print(f"验证失败: {e}, 模型输出答案: {answer_parsed}, 标准答案: {gold_parsed}") return 0.0
提供机构:
maas
创建时间:
2025-07-19
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作