five

nomis92/Llama-Nemotron-German

收藏
Hugging Face2025-11-25 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/nomis92/Llama-Nemotron-German
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - de size_categories: - 100K<n<1M tags: - reasoning dataset_info: features: - name: input dtype: string - name: output dtype: string - name: reasoning_trace dtype: string - name: output_generated dtype: string splits: - name: science num_bytes: 7969548477 num_examples: 708920 download_size: 3041727063 dataset_size: 7969548477 configs: - config_name: default data_files: - split: science path: data/science-* --- # Llama-Nemotron-German A German language dataset derived from the [NVIDIA Llama-Nemotron Post-Training Dataset](https://huggingface.co/datasets/nvidia/Llama-Nemotron-Post-Training-Dataset) with reasoning traces generated through a two-step translation and reasoning pipeline. ## Dataset Overview - **Language**: German (de) - **Size**: 708,920 samples - **Subset**: Science (more to be added) - **Sources**: NVIDIA Nemotron SFT dataset - **Task**: Question answering with reasoning traces ## Dataset Fields Each sample contains: - `input`: German translation of the original prompt/question - `output`: German translation of the original reference answer - `reasoning_trace`: Internal reasoning/thinking process (German) - `output_generated`: Generated final answer based on reasoning (German) - `generator`: Model used for generation ## Creation Pipeline ### Stage 1: Translation (Qwen/Qwen3-32B) Translates original English prompts from the Nemotron dataset to German. **Prompt Template:** ``` System: "You are a professional translator specializing in technical and educational content. Translate the following {field_name} text into German. CRITICAL INSTRUCTIONS: 1. Output ONLY the translated text - no explanations or meta-commentary 2. Preserve ALL technical terms, code snippets, mathematical notation, and formatting 3. Maintain the same tone, style, and formality as the original 4. Use formal German (Sie) for professional/technical content 5. For code: Keep variable/function names in English unless user-facing 6. For math: Preserve LaTeX notation and symbols unchanged 7. Adapt examples and cultural references appropriately for German audiences 8. Maintain consistent terminology throughout" User: "TEXT TO TRANSLATE: {text}" ``` **Hyperparameters:** - Temperature: 0.7 - Top-p: 0.8 - Top-k: 20 - Max tokens: 32,768 - Batch size: 1,024 ### Stage 2: Reasoning Generation (mistralai/Magistral-Small-2506) Generates reasoning traces and final answers using the translated prompts as hints. **Prompt Template:** ``` System: "A user will ask you to solve a task. You have been provided with a solution as a hint to guide your reasoning. Draft your thinking process (inner monologue) working through the problem step-by-step, using the provided solution as guidance. Afterwards, write a self-contained summary of your thoughts. Your thinking process must follow: <think> Your thoughts/draft - be casual and thorough. Use the solution hint to guide your reasoning. </think> Here, provide a concise summary reflecting your reasoning and the final answer." User: "Problem: {input_text} Solution Hint: {output_hint}" ``` **Hyperparameters:** - Temperature: 0.7 - Top-p: 0.95 - Max tokens: 40,960 - Batch size: 1,024 ## Load the Dataset ```python from datasets import load_dataset dataset = load_dataset("nomis92/Llama-Nemotron-German", "science") ``` ## Dataset Statistics ### Sample Counts - Total items: 708,920 - Valid reasoning traces: 680,995 (96.1%) - Failed reasoning generations: 27,925 (3.9%) ### Token Counts | Column | Mean | Median | Total | |--------|------|--------|-------| | Input | 225 | 190 | 159.8M | | Reasoning | 2,212 | 1,417 | 1,506.1M | | Output (Original) | 399 | 364 | 282.7M | | Output (Generated) | 588 | 165 | 405.3M | | **Combined** | **2,922** | **1,849** | **2,071.2M** | ### Data Quality **Language Distribution (German):** - Input: 99.88% - Reasoning: 99.53% - Output (Original): 99.99% - Output (Generated): 99.51% **Lexical Diversity (avg):** - Input: 6.55 - Reasoning: 10.87 - Output (Original): 8.37 - Output (Generated): 6.96 **Unique Words Ratio:** - Input: 75.9% - Reasoning: 39.9% - Output (Original): 72.4% - Output (Generated): 79.7% ## Licensing This dataset is derived from NVIDIA's Llama-Nemotron dataset. Please refer to the [original dataset](https://huggingface.co/datasets/nvidia/Llama-Nemotron-Post-Training-Dataset) for licensing information. ## Citation ```bibtex @dataset{gurgurov2025llamanemotrongerman, title={Llama-Nemotron-German: A German Reasoning Dataset with Multi-Step Pipeline}, author={Gurgurov, Daniil and Ostermann, Simon}, year={2025}, publisher={Hugging Face}, howpublished={\url{https://huggingface.co/datasets/nomis92/Llama-Nemotron-German}}, note={Derived from NVIDIA Llama-Nemotron Post-Training Dataset with two-stage translation and reasoning generation pipeline} } ``` ## Acknowledgments - Dataset source: [NVIDIA Llama-Nemotron Post-Training Dataset](https://huggingface.co/datasets/nvidia/Llama-Nemotron-Post-Training-Dataset) - Translation model: [Qwen/Qwen3-32B](https://huggingface.co/Qwen/Qwen3-32B) - Reasoning model: [mistralai/Magistral-Small-2506](https://huggingface.co/mistralai/Magistral-Small-2506)
提供机构:
nomis92
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作