toroe/Soofi-Think-SFT-10B-multilingual

Name: toroe/Soofi-Think-SFT-10B-multilingual
Creator: toroe
Published: 2026-03-27 11:43:24
License: 暂无描述

Hugging Face2026-03-27 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/toroe/Soofi-Think-SFT-10B-multilingual

下载链接

链接失效反馈

官方服务：

资源简介：

--- configs: - config_name: default data_files: - split: english path: data/english-* - split: italian path: data/italian-* - split: german path: data/german-* - split: french path: data/french-* - split: spanish path: data/spanish-* dataset_info: features: - name: messages list: - name: content dtype: string - name: role dtype: string - name: source dtype: string - name: dataset_name dtype: string - name: ds_uid dtype: int64 - name: language dtype: string - name: row_index dtype: int64 splits: - name: english num_bytes: 33593217838 num_examples: 2283204 - name: italian num_bytes: 25920841021 num_examples: 2283204 - name: german num_bytes: 27324918834 num_examples: 2283204 - name: french num_bytes: 29051025609 num_examples: 2283204 - name: spanish num_bytes: 28383397334 num_examples: 2283204 download_size: 66853417655 dataset_size: 144273400636 license: apache-2.0 language: - en - de - fr - es - it --- # Reason<sub>XL</sub>: A Multilingual Cross-Domain Reasoning Corpus **Reason**<sub>XL</sub> is a large-scale multilingual reasoning corpus spanning 5 languages and ~44B tokens in total. It is designed to support supervised fine-tuning of reasoning models with in-language chain-of-thought traces across diverse technical domains. --- ## Data Generation English source samples were drawn from 10 existing reasoning datasets, filtered and quality-annotated using [`ellamind/propella-1-4b`](https://huggingface.co/ellamind/propella-1-4b), and then translated into four European languages (German, French, Spanish, Italian) using `Qwen3-32B` served via vLLM. Each sample consists of three independently translated components: the **user input**, the **reasoning trace** (within `<think>` tags), and the **final output**. Translation used nucleus sampling at low temperature (T=0.1, top-p=1.0) with a dedicated system prompt instructing the model to preserve technical terminology, mathematical notation, and reasoning structure. English samples were annotated across 18 properties (safety, information density, educational value, audience, domain, etc.) and filtered through a multi-stage pipeline enforcing integrity constraints, domain-dependent quality thresholds, and class-aware downsampling for domain balance. Annotations transfer directly to all translations without re-annotation. --- ### Translation Prompt Each field (input, reasoning trace, output) was translated independently using the following prompt template: --- ``` SYSTEM: You are a professional translator specializing in technical and educational content. Translate the following {field} text into {language}. CRITICAL INSTRUCTIONS: 1. Output ONLY the translated text 2. Preserve ALL technical terms, code snippets, mathematical notation, and formatting exactly 3. Maintain the same tone, style, and formality 4. {language-specific formality guidance} 5. For code: Keep variable/function names in English 6. For math: Preserve LaTeX notation unchanged 7. Adapt examples and cultural references appropriately 8. Maintain terminology consistency throughout ``` --- ``` USER: TEXT TO TRANSLATE: {text} ``` --- Language-specific formality guidance: - **German**: Use formal German (*Sie*) for professional/technical content - **Spanish**: Use neutral Spanish suitable for international audiences - **French**: Use standard French with appropriate formality - **Italian**: Use standard Italian with professional tone --- ## Data Sources | Dataset | Config | Samples | |---|---|---| | Cascade-SFT-Stage-2 | general / math | 768,615 | | Dolci-Think-SFT-7B | science | 347,453 | | Cascade-SFT-Stage-1 | general / code / math / science | 711,812 | | Llama-Nemotron-PTD | science | 267,147 | | Nemotron-Science-v1 | — | 97,026 | | Nemotron-IF-Chat-v1 | — | 91,151 | | **Total** | | **2,282,204** | --- ## Statistics | Language | Tokens (B) | Avg. Total Length | Avg. Input | Avg. Output | |---|---|---|---|---| | English (`en`) | 9.2 | 4,023 | 424 | 3,599 | | German (`de`) | 8.8 | 3,866 | 504 | 3,363 | | French (`fr`) | 8.8 | 3,872 | 493 | 3,379 | | Spanish (`es`) | 8.7 | 3,796 | 478 | 3,318 | | Italian (`it`) | 8.5 | 3,742 | 495 | 3,247 | | **Total** | **44.07** | — | — | — | The corpus is designed as a **living resource** — the translation pipeline is ongoing, with the full release containing approximately twice as many tokens per language as the current version. --- ## Citation ```bibtex @misc{reasonxl2026, title = {Reason{XL}: A Multilingual Cross-Domain Reasoning Corpus}, author = {Daniil Gurgurov and Tom Röhr}, year = {2026}, publisher = {Hugging Face}, howpublished = {\url{https://huggingface.co/datasets/toroe/Soofi-Think-SFT-10B-multilingual}} } ``` > Paper citation will be added upon publication.

提供机构：

toroe

5,000+

优质数据集

54 个

任务类型

进入经典数据集