tokyotech-llm/Swallow-Nemotron-Post-Training-Dataset-v1
收藏Hugging Face2026-02-21 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/tokyotech-llm/Swallow-Nemotron-Post-Training-Dataset-v1
下载链接
链接失效反馈官方服务:
资源简介:
---
configs:
- config_name: Nemotron-Post-Training-Dataset-v1
data_files:
- split: code
path: >-
code/nemotron-post-training-v1-code-deepseek-r1-model-identity-chat-gpt.jsonl
- split: math
path: >-
math/nemotron-post-training-v1-math-deepseek-r1-model-identity-chat-gpt.jsonl
- split: stem
path: >-
stem/nemotron-post-training-v1-science-deepseek-r1-model-identity-chat-gpt.jsonl
- config_name: GPT-OSS-Nemotron-Post-Training-Dataset-v1
data_files:
- split: code
path: code/nemotron-post-training-v1-code-gpt-oss-model-identity-chat-gpt.jsonl
- split: math
path: math/nemotron-post-training-v1-math-gpt-oss-model-identity-chat-gpt.jsonl
- split: stem
path: >-
stem/nemotron-post-training-v1-science-gpt-oss-model-identity-chat-gpt.jsonl
- config_name: GPT-OSS-Nemotron-Post-Training-Dataset-v1-Ja
data_files:
- split: code
path: >-
code/nemotron-post-training-v1-ja-code-gpt-oss-model-identity-chat-gpt.jsonl
- split: math
path: >-
math/nemotron-post-training-v1-ja-math-gpt-oss-model-identity-chat-gpt.jsonl
- split: stem
path: >-
stem/nemotron-post-training-v1-ja-science-gpt-oss-model-identity-chat-gpt.jsonl
- config_name: GPT-OSS-Nemotron-Post-Training-Dataset-v1-Ja-202601
data_files:
- split: code
path: >-
code/nemotron-post-training-v1-ja-code-gpt-oss-model-identity-chat-gpt_v202601.jsonl.gz
- split: math
path: >-
math/nemotron-post-training-v1-ja-math-gpt-oss-model-identity-chat-gpt_v202601.jsonl.gz
- split: stem
path: >-
stem/nemotron-post-training-v1-ja-stem-gpt-oss-model-identity-chat-gpt_v202601.jsonl.gz
- config_name: GPT-OSS-High-Nemotron-Post-Training-Dataset-v1
data_files:
- split: code
path: >-
code/nemotron-post-training-v1-code-gpt-oss-model-identity-chat-gpt-reasoning-effort-high.jsonl
- split: math
path: >-
math/nemotron-post-training-v1-math-gpt-oss-model-identity-chat-gpt-reasoning-effort-high.jsonl
- split: stem
path: >-
stem/nemotron-post-training-v1-science-gpt-oss-model-identity-chat-gpt-reasoning-effort-high.jsonl
- config_name: Nemotron-Post-Training-Dataset-v1-No-Thinking-Trajectory
data_files:
- split: code
path: >-
code/nemotron-post-training-v1-code-deepseek-r1-no-thinking-trajectory.jsonl
- split: math
path: >-
math/nemotron-post-training-v1-math-deepseek-r1-no-thinking-trajectory.jsonl
- split: stem
path: >-
stem/nemotron-post-training-v1-science-deepseek-r1-no-thinking-trajectory.jsonl
license: cc-by-4.0
task_categories:
- text-generation
language:
- en
- ja
tags:
- code
- math
- stem
---
# Swallow-Nemotron-Post-Training-Dataset-v1
The [Swallow LLM Project](https://swallow-llm.github.io/index.en.html) constructed the **Swallow-Nemotron-Post-Training-Dataset-v1** based on the math, code, and stem subsets of the [NVIDIA Nemotron-Post-Training-Dataset-v1](https://huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v1), as illustrated in the figure below.
<img src="https://huggingface.co/datasets/tokyotech-llm/Swallow-Nemotron-Post-Training-Dataset-v1/resolve/main/Swallow-Nemotron-Post-Training-v1.png" width="500">
## Dataset Construction
The original Thinking Trajectories and Assistant Outputs in the Nemotron-Post-Training-Dataset-v1 were synthesized using DeepSeek-R1-0528.
However, we identified an issue with the Thinking Trajectories being overly verbose.
To resolve this, we constructed our own Reasoning SFT dataset by re-synthesizing the Thinking Trajectories and Assistant Outputs corresponding to the User Inputs using [GPT-OSS-120B](https://huggingface.co/openai/gpt-oss-120b) (with reasoning_effort set to medium).
During this generation process, we filtered out outputs that did not comply with the GPT-OSS [chat template rules](https://github.com/openai/harmony) (such as unclosed Thinking Trajectory tags).
We extracted only the valid examples to build the final Swallow-Nemotron-Post-Training-Dataset-v1.
## v1-Ja-202601 Subset
The subset `GPT-OSS-Nemotron-Post-Training-Dataset-v1-Ja-202601` is a variant of v1-Ja that includes response language annotations in the `response_language` field.
v1-Ja contains responses in English, which may cause an off-target issue (i.e., the model responding in English to Japanese instructions) when used for training as is. To address this concern, we annotated the response language so users can filter out non-Japanese responses by specifying `response_language == "ja"`. Users may also include "UNK" to retain short responses, which frequently appear in the stem split.
For language detection, we first preprocessed the responses by removing Markdown, LaTeX, and code snippets, and then applied [Compact Language Detector v3 (CLD3)](https://github.com/google/cld3).
## Handling of Japanese
To ensure the model returns Japanese Assistant Outputs in response to Japanese User Inputs, we translated the User Inputs from the original dataset into Japanese using GPT-OSS-120B.
We then used these translated inputs to generate the new Thinking Trajectories and Assistant Outputs.
Although we used prompts to guide GPT-OSS-120B to respond in Japanese, the resulting Thinking Trajectories ended up being a mix of English and Japanese.
While we successfully localized the Assistant Outputs, we were unable to fully translate the Thinking Trajectories into Japanese. We consider this an area for future improvement.
> [!NOTE]
> However, observing that commercial models like ChatGPT, Claude, and Gemini can flawlessly return Japanese responses even when their internal thinking trajectories are in English, we concluded that having strictly Japanese thinking trajectories is not an absolute necessity. Therefore, we chose not to force the Japanese localization of the Thinking Trajectories for this release.
## Release History
- **Feb 19, 2026**: Released Swallow Nemotron Post Training Dataset v1.
## Acknowledgements
We thank the OpenAI Team for releasing GPT-OSS under a generous open license.
<!-- AIST -->
This work is based on results obtained from AIST policy-based budget project "R&D on Generative AI Foundation Models for the Physical Domain".
<!-- NII -->
This work was supported by the “R&D Hub Aimed at Ensuring Transparency and Reliability of Generative AI Models” project of the Ministry of Education, Culture, Sports, Science and Technology.
<!-- ABCI -->
We used ABCI 3.0 provided by AIST and AIST Solutions with support from "ABCI 3.0 Development Acceleration Use".
<!-- TSUBAME -->
This study was carried out using the TSUBAME4.0 supercomputer at Institute of Science Tokyo.
## Intended use
The Swallow Nemotron Post-Training Dataset is intended to be used by the research community to continue to improve open models. The data may be freely used to train and evaluate.
## License
This dataset is licensed under the Creative Commons Attribution 4.0 International License (CC BY 4.0) available at https://creativecommons.org/licenses/by/4.0/legalcode.
## Authors
[Swallow LLM](https://swallow-llm.github.io/index.en.html)
## How to cite
If you find our work helpful, please feel free to cite these papers.
**The Qwen3-Swallow and GPT-OSS-Swallow Technical Paper (Training Details) will be released in March.**
### References
[NVIDIA, 2025] NVIDIA. [Nemotron-Post-Training-Dataset-v1](https://huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v1).
提供机构:
tokyotech-llm



