nomis92/Llama-Nemotron-German
收藏Hugging Face2025-11-25 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/nomis92/Llama-Nemotron-German
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- de
size_categories:
- 100K<n<1M
tags:
- reasoning
dataset_info:
features:
- name: input
dtype: string
- name: output
dtype: string
- name: reasoning_trace
dtype: string
- name: output_generated
dtype: string
splits:
- name: science
num_bytes: 7969548477
num_examples: 708920
download_size: 3041727063
dataset_size: 7969548477
configs:
- config_name: default
data_files:
- split: science
path: data/science-*
---
# Llama-Nemotron-German
A German language dataset derived from the [NVIDIA Llama-Nemotron Post-Training Dataset](https://huggingface.co/datasets/nvidia/Llama-Nemotron-Post-Training-Dataset) with reasoning traces generated through a two-step translation and reasoning pipeline.
## Dataset Overview
- **Language**: German (de)
- **Size**: 708,920 samples
- **Subset**: Science (more to be added)
- **Sources**: NVIDIA Nemotron SFT dataset
- **Task**: Question answering with reasoning traces
## Dataset Fields
Each sample contains:
- `input`: German translation of the original prompt/question
- `output`: German translation of the original reference answer
- `reasoning_trace`: Internal reasoning/thinking process (German)
- `output_generated`: Generated final answer based on reasoning (German)
- `generator`: Model used for generation
## Creation Pipeline
### Stage 1: Translation (Qwen/Qwen3-32B)
Translates original English prompts from the Nemotron dataset to German.
**Prompt Template:**
```
System: "You are a professional translator specializing in technical and educational content.
Translate the following {field_name} text into German.
CRITICAL INSTRUCTIONS:
1. Output ONLY the translated text - no explanations or meta-commentary
2. Preserve ALL technical terms, code snippets, mathematical notation, and formatting
3. Maintain the same tone, style, and formality as the original
4. Use formal German (Sie) for professional/technical content
5. For code: Keep variable/function names in English unless user-facing
6. For math: Preserve LaTeX notation and symbols unchanged
7. Adapt examples and cultural references appropriately for German audiences
8. Maintain consistent terminology throughout"
User: "TEXT TO TRANSLATE: {text}"
```
**Hyperparameters:**
- Temperature: 0.7
- Top-p: 0.8
- Top-k: 20
- Max tokens: 32,768
- Batch size: 1,024
### Stage 2: Reasoning Generation (mistralai/Magistral-Small-2506)
Generates reasoning traces and final answers using the translated prompts as hints.
**Prompt Template:**
```
System: "A user will ask you to solve a task. You have been provided with a solution as a hint
to guide your reasoning. Draft your thinking process (inner monologue) working through the problem
step-by-step, using the provided solution as guidance. Afterwards, write a self-contained summary
of your thoughts.
Your thinking process must follow:
<think>
Your thoughts/draft - be casual and thorough. Use the solution hint to guide your reasoning.
</think>
Here, provide a concise summary reflecting your reasoning and the final answer."
User: "Problem: {input_text}
Solution Hint: {output_hint}"
```
**Hyperparameters:**
- Temperature: 0.7
- Top-p: 0.95
- Max tokens: 40,960
- Batch size: 1,024
## Load the Dataset
```python
from datasets import load_dataset
dataset = load_dataset("nomis92/Llama-Nemotron-German", "science")
```
## Dataset Statistics
### Sample Counts
- Total items: 708,920
- Valid reasoning traces: 680,995 (96.1%)
- Failed reasoning generations: 27,925 (3.9%)
### Token Counts
| Column | Mean | Median | Total |
|--------|------|--------|-------|
| Input | 225 | 190 | 159.8M |
| Reasoning | 2,212 | 1,417 | 1,506.1M |
| Output (Original) | 399 | 364 | 282.7M |
| Output (Generated) | 588 | 165 | 405.3M |
| **Combined** | **2,922** | **1,849** | **2,071.2M** |
### Data Quality
**Language Distribution (German):**
- Input: 99.88%
- Reasoning: 99.53%
- Output (Original): 99.99%
- Output (Generated): 99.51%
**Lexical Diversity (avg):**
- Input: 6.55
- Reasoning: 10.87
- Output (Original): 8.37
- Output (Generated): 6.96
**Unique Words Ratio:**
- Input: 75.9%
- Reasoning: 39.9%
- Output (Original): 72.4%
- Output (Generated): 79.7%
## Licensing
This dataset is derived from NVIDIA's Llama-Nemotron dataset. Please refer to the [original dataset](https://huggingface.co/datasets/nvidia/Llama-Nemotron-Post-Training-Dataset) for licensing information.
## Citation
```bibtex
@dataset{gurgurov2025llamanemotrongerman,
title={Llama-Nemotron-German: A German Reasoning Dataset with Multi-Step Pipeline},
author={Gurgurov, Daniil and Ostermann, Simon},
year={2025},
publisher={Hugging Face},
howpublished={\url{https://huggingface.co/datasets/nomis92/Llama-Nemotron-German}},
note={Derived from NVIDIA Llama-Nemotron Post-Training Dataset with two-stage translation and reasoning generation pipeline}
}
```
## Acknowledgments
- Dataset source: [NVIDIA Llama-Nemotron Post-Training Dataset](https://huggingface.co/datasets/nvidia/Llama-Nemotron-Post-Training-Dataset)
- Translation model: [Qwen/Qwen3-32B](https://huggingface.co/Qwen/Qwen3-32B)
- Reasoning model: [mistralai/Magistral-Small-2506](https://huggingface.co/mistralai/Magistral-Small-2506)
提供机构:
nomis92



