tokyotech-llm/s1-test-time-scaling-synth-public

Name: tokyotech-llm/s1-test-time-scaling-synth-public
Creator: tokyotech-llm
Published: 2026-02-19 10:54:25
License: 暂无描述

Hugging Face2026-02-19 更新2026-04-05 收录

下载链接：

https://hf-mirror.com/datasets/tokyotech-llm/s1-test-time-scaling-synth-public

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: other license_name: mixed license_link: >- https://huggingface.co/datasets/simplescaling/data_ablation_full59K/blob/main/README.md task_categories: - text-generation language: - ja - en pretty_name: s1-test-time-scaling-synth size_categories: - 10K<n<100K source_datasets: - simplescaling/data_ablation_full59K viewer: true dataset_info: features: - name: solution dtype: string - name: question dtype: string - name: cot_type dtype: string - name: source_type dtype: string - name: metadata dtype: string - name: cot dtype: 'null' - name: thinking_trajectories sequence: string - name: attempt dtype: string - name: huggingface_id dtype: string - name: huggingface_subset dtype: string - name: question_ja_by_gpt-oss sequence: string - name: question_ja_by_gpt-oss_gemba_score sequence: int64 - name: translated_question dtype: string - name: translated_question_gemba_mqm_score dtype: int64 - name: answer dtype: string - name: answerable dtype: bool - name: id dtype: string configs: - config_name: v202512 data_files: - split: train path: annotated/data_ablation_full59K_synthesized_v202512.jsonl.gz --- # s1-test-time-scaling-synth: Japanese and English Reinforcement Learning Dataset Derived from the s1 Simple Test-Time Scaling Dataset This repository contains s1-test-time-scaling-synth, a reinforcement learning dataset in Japanese and English. This dataset is built upon the supervised fine-tuning dataset [simplescaling/data_ablation_full59K](https://huggingface.co/datasets/simplescaling/data_ablation_full59K) (hereafter, the "original dataset"), originally developed in "s1: Simple test-time scaling" [[Muennighoff+, EMNLP25]](https://aclanthology.org/2025.emnlp-main.1025/). The original dataset is a compilation of existing datasets covering mathematics, science, and code generation tasks, and was reported to be effective in distilling reasoning traces generated using frontier reasoning models (Gemini). Motivated by their report, we curated the original dataset so that it can be used for reinforcement learning with verifiable rewards (RLVR), where problem statements can be given in both Japanese and English to investigate the language specificity of RLVR training. Specifically, we performed the following annotations and modifications: * Translation of original problem statements into Japanese * Extraction of ground-truth solutions in an "RLVR-ready" format * Annotation of answerability We applied best-of-N translation using [gpt-oss-120b](https://huggingface.co/openai/gpt-oss-120b). Specifically, we generated eight translation candidates in a zero-shot setting, followed by rejection sampling based on a self-assessed translation quality using GEMBA-MQM [[Kocmi and Federmann, EAMT23](https://aclanthology.org/2023.eamt-1.19/)]. We primarily extracted short ground-truth answers from metadata field in the original dataset. We then annotated answeability for each problem by considering the problem type, the availability of ground-truth answer, translation quality, etc. We also conducted a light quatitative validation of answerability for reference. Since the original dataset is sourced from a wide variety of datasets ([Muennighoff+, EMNLP25] Appendix D.2), users should carefully confirm the licenses, commercial use, and potential benchmark leakage when using this dataset for developing LLMs. To assist users' investigation, we have attached the source information of the original dataset we investigated at the end of this document. ## Usage of the s1-test-time-scaling-synth dataset You can load the dataset using the `datasets` package. ``` from datasets import load_dataset dataset = load_dataset("tokyotech-llm/s1-test-time-scaling-synth-public", "v202512", split="train") ``` For reinforcement learning, use `question` (English) or `translated_question` (Japanese) as the problem statement. To verify the solution, use `answer` as the ground truth. Some problems may be unanswerable (e.g., missing ground truth or containing critical translation errors). You can exclude these potentially unanswerable problem statements by filtering the records using `answerable==True`. ``` subset = dataset.filter(lambda x: x["answerable"]) print(len(dataset), len(subset)) ``` ## Dataset format The number of records is 58,986, which is identical to the original dataset. As we extended the original dataset by adding new fields for annotations and modifications, the dataset fields are split into (i) fields inherited from the original dataset and (ii) fields added in our work. ### Fields inherited from the original dataset The following fields are provided as-is from the original dataset. * id * Unique ID of the problem. * question * Problem statement. A single record may contain multiple questions. * solution * The ground-truth answer and its explanation. The format varies depending on the source; mathematical problems typically include derivation steps. * cot_type * The Chain-of-Thought prompt type used by the original authors to generate Gemini reasoning traces. * source_type * Hugging Face Dataset ID and subset name of the source dataset (e.g., `qfq/openaimath/Intermediate Algebra`). * metadata * Metadata obtained from the source dataset, serialized as a JSON string. * You can restore it as a dictionary with `json.loads(metadata)`. * metadata.answer * The short answer, presumed to originate from the source dataset. It may be NULL depending on the source type. Refer to the appendix for details. * cot * Always NULL. * thinking_trajectories * Reasoning traces generated by Gemini. * attempt * Answer generated by Gemini. ### Fields added in this work * question_ja_by_gpt-oss * Candidate translations (up to 8 samples) generated using gpt-oss-120b. * question_ja_by_gpt-oss_gemba_score * Translation quality scores for candidate translations, self-assessed using gpt-oss-120b with GEMBA-MQM method. Range: [-25, 0]. * translated_question * The Japanese translation with the best translation quality score. * translated_question_gemba_mqm_score * GEMBA-MQM score of the `translated_question`. * huggingface_id * Hugging Face Dataset ID of the source dataset. * huggingface_subset * Subset name of the source dataset. * answer * Short ground-truth answer extracted from the original fields such as `metadata.answer`, `solution`, etc. Refer to the appendix for details. * answerable * Boolean value indicating whether the ground-truth answer can be derived from both `translated_question` and `question`. ## License Information Different licenses apply to different parts of the dataset. * Problems, ground-truth answers, and fields inherited from the original dataset * These are subject to the license of the source dataset from which each problem is derived. * Reasoning traces and answers generated by Gemini * These are subject to the Google APIs Terms of Service and the Gemini API Additional Terms of Service: * https://developers.google.com/terms * https://ai.google.dev/gemini-api/terms * Japanese translations of the problem statements * These are subject to both: * the license of the source dataset, and * the Apache License 2.0 under which OpenAI’s gpt-oss-120b (used for translation) is released. * https://huggingface.co/openai/gpt-oss-120b ## Acknowledgments We gratefully acknowledge Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candes, and Tatsunori Hashimoto, the author of the original s1 paper [[Muennighoff+, EMNLP25]](https://aclanthology.org/2025.emnlp-main.1025/). This work is based on results obtained from AIST policy-based budget project "R&D on Generative AI Foundation Models for the Physical Domain". This work is based on results obtained from a project, JPNP18002, commissioned by the New Energy and Industrial Technology Development Organization (NEDO). We used ABCI 3.0 provided by AIST and AIST Solutions with support from "ABCI 3.0 Development Acceleration Use". This study was carried out using the TSUBAME4.0 supercomputer at Institute of Science Tokyo. This work is based on OpenAI's gpt-oss. We acknowledge and thank OpenAI for their contributions and the release of these models. # Appendix ## A.1 Details of the problem statements translation We used gpt-oss-120b to perform best-of-N translation (generation of candidate translations and translation quality assessment). Specifically, we used the generation parameters `reasoning_effort=medium` and `temperature=1.0`. We optimized the translation instruction (prompt) using the GEPA algorithm ([Agrawal+, ICLR26](https://openreview.net/forum?id=RQm2KQTM5r)) within the [DSPy](https://dspy.ai/) framework. ## A.2 Annotation of short ground-truth answers We primarily obtained short ground-truth answers from metadata fields such as `metadata.answer`, `metadata.label`, and `metadata.final_answer`. For source datasets that do not provide ground-truth answers in metadata, we heuristically extracted short answers from `solution` and/or `attempt`. ## A.3 Annotation of answerability The `answerable` label was assigned according to the following criteria. We mark proof-based problems as unanswerable (`False`), since they are generally difficult to evaluate using short answers. * `False` for math proof problems * `False` if the translation quality score (GEMBA-MQM) is ≤ -10 * `False` if short ground-truth answer extraction/annotation fails * `False` if the short ground-truth answer is `\\blacksquare` * `True` otherwise As a result, 43,351 samples were annotated with `answerable=True`. ## A.4 Verification of the answerability To quatitatively validate answerability—whether the annotated ground-truth answer can be derived from the (translated) problem statement—we evaluated the accuracy of a frontier LLM using problem statements and ground-truth answers. Specifically, we randomly sampled 100 problems from each of three source datasets, and measured GPT-5 (`gpt-5-2025-08-07`) accuracy when prompted with the Japanese (`translated_question`) and English (`question`) versions of the problem statements. The results are as follows (accuracy in %): |source dataset|Ja|En| |--|--:|--:| |baber/agieval/logiqa|77|78| |KbsdJames/Omni-MATH|81|88| |AI-MO/NuminaMath-CoT/aops_forum|80|85| Across all three source datasets, we confirmed that the accuracy is around or above 80%, and that the difference between Japanese and English prompts is small. ## A.5 Source information of the original dataset The problems included in the original dataset are sourced from 16 existing datasets. To help users identify the source dataset for each problem, we extended Table 6 in [[Muennighoff+, EMNLP25]](https://aclanthology.org/2025.emnlp-main.1025/) and investigated: 1. The correspondence between each source dataset and `huggingface_id` field value. 2. A brief description of each dataset and the number of samples 3. The license 4. Whether proprietary LLMs (e.g., GPT-4) were used to generate or edit any part of the problems (problem statements, reasoning traces, ground-truths, etc.) We have made reasonable efforts to investigate the source datasets. However, we do not guarantee the accuracy, completeness, or legal correctness of these determinations. This statement does not constitute legal advice and should not be relied upon as such. If there is any discrepancy between this document and the paper, the paper's description takes precedence. |Source|Description|# Samples (Paper)|# Samples (dataset)|HF Dataset ID (`huggingface_id`)|License|Proprietary LLMs Usage| |:--|:--|:--|:--|:--|:--|:--| |NuminaMATH (LI et al., 2024)|Math problems from online websites|30,660|30,658|AI-MO/NuminaMath-CoT|Apache License 2.0|GPT-4, reasoning traces| |MATH (Hendrycks et al., 2021)|Math problems from competitions|11,999|11,958|qfq/openaimath|MIT|No| |OlympicArena (Huang et al., 2024a)|Astronomy, Biology, Chemistry, Computer Science, Geography, Math, and Physics olympiad questions|4,250|4,250|GAIR/OlympicArena|CC BY-NC-SA 4.0|GPT-4, difficulty and correctness check| |OmniMath (Gao et al., 2024a)|Math problems from competitions|4,238|4,238|KbsdJames/Omni-MATH|Apache License 2.0|GPT-4o, difficulty annotation| |AGIEval (Zhong et al., 2023; Ling et al., 2017; Hendrycks et al., 2021; Liu et al., 2020; Zhong et al., 2019; Wang et al., 2021)|English, Law, Logic and Math problems from the SAT, LSAT and other exams|2,385|2,385|baber/agieval|Mixed (MIT for code; original exams may be copyrighted)|No| |xword|Crossword puzzles|999|999|0xharib/xword1|Proprietary (including New York Times, etc.)|No| |OlympiadBench (He et al., 2024b)|Math and Physics olympiad questions|896|896|Hothan/OlympiadBench|Unknown|No| |AIME (1983-2021)|American Invitational Mathematics Examination|890|890|qq8933/AIME_1983_2024|Proprietary (Mathematical Association of America, AoPS)|No| |TheoremQA (Chen et al., 2023)|Computer Science, Finance, Math, and Physics university-level questions relating to theorems|747|747|TIGER-Lab/TheoremQA|MIT|No| |USACO (Shi et al., 2024)|Code problems from the USA Computing Olympiad|519|519|codegenning/usacobench_formatted|Proprietary (USA Computing Olympiad)|GPT family, reference implementation| |JEEBench (Arora et al., 2023)|Chemistry, Math, and Physics problems used in the university entrance examination of the Indian Institute of Technology|515|515|daman1209arora/jeebench|MIT|GPT-3.5/4, solution (not ground-truth answer)| |GPQA (Rein et al., 2023)|PhD-Level Science Questions|348|348|Idavidrein/gpqa|MIT|No| |SciEval (Sun et al., 2024)|Biology, Chemistry, and Physics problems from various sources|227|227|OpenDFM/SciEval|CC BY 4.0|No| |s1-prob|Stanford statistics qualifying exams|182|182|qfq/stats_qual|Apache License 2.0|No| |s1-teasers|Math brain-teasers crawled from the Internet|23|23|qfq/quant|Apache License 2.0|No| |LiveCodeBench (Jain et al., 2024)|Code problems from coding websites (LeetCode, AtCoder, and CodeForces)|151|151|LiveCodeBench/release_v[1,2,3]|Proprietary (LeetCode, AtCoder, CodeForces, etc.)|No| End of document

提供机构：

tokyotech-llm

5,000+

优质数据集

54 个

任务类型

进入经典数据集