toroe/Soofi-Think-SFT-10B-multilingual
收藏Hugging Face2026-03-27 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/toroe/Soofi-Think-SFT-10B-multilingual
下载链接
链接失效反馈官方服务:
资源简介:
---
configs:
- config_name: default
data_files:
- split: english
path: data/english-*
- split: italian
path: data/italian-*
- split: german
path: data/german-*
- split: french
path: data/french-*
- split: spanish
path: data/spanish-*
dataset_info:
features:
- name: messages
list:
- name: content
dtype: string
- name: role
dtype: string
- name: source
dtype: string
- name: dataset_name
dtype: string
- name: ds_uid
dtype: int64
- name: language
dtype: string
- name: row_index
dtype: int64
splits:
- name: english
num_bytes: 33593217838
num_examples: 2283204
- name: italian
num_bytes: 25920841021
num_examples: 2283204
- name: german
num_bytes: 27324918834
num_examples: 2283204
- name: french
num_bytes: 29051025609
num_examples: 2283204
- name: spanish
num_bytes: 28383397334
num_examples: 2283204
download_size: 66853417655
dataset_size: 144273400636
license: apache-2.0
language:
- en
- de
- fr
- es
- it
---
# Reason<sub>XL</sub>: A Multilingual Cross-Domain Reasoning Corpus
**Reason**<sub>XL</sub> is a large-scale multilingual reasoning corpus spanning 5 languages and ~44B tokens in total. It is designed to support supervised fine-tuning of reasoning models with in-language chain-of-thought traces across diverse technical domains.
---
## Data Generation
English source samples were drawn from 10 existing reasoning datasets, filtered and quality-annotated using [`ellamind/propella-1-4b`](https://huggingface.co/ellamind/propella-1-4b), and then translated into four European languages (German, French, Spanish, Italian) using `Qwen3-32B` served via vLLM.
Each sample consists of three independently translated components: the **user input**, the **reasoning trace** (within `<think>` tags), and the **final output**. Translation used nucleus sampling at low temperature (T=0.1, top-p=1.0) with a dedicated system prompt instructing the model to preserve technical terminology, mathematical notation, and reasoning structure.
English samples were annotated across 18 properties (safety, information density, educational value, audience, domain, etc.) and filtered through a multi-stage pipeline enforcing integrity constraints, domain-dependent quality thresholds, and class-aware downsampling for domain balance. Annotations transfer directly to all translations without re-annotation.
---
### Translation Prompt
Each field (input, reasoning trace, output) was translated independently using the following prompt template:
---
```
SYSTEM: You are a professional translator specializing in technical and
educational content. Translate the following {field} text into {language}.
CRITICAL INSTRUCTIONS:
1. Output ONLY the translated text
2. Preserve ALL technical terms, code snippets, mathematical notation,
and formatting exactly
3. Maintain the same tone, style, and formality
4. {language-specific formality guidance}
5. For code: Keep variable/function names in English
6. For math: Preserve LaTeX notation unchanged
7. Adapt examples and cultural references appropriately
8. Maintain terminology consistency throughout
```
---
```
USER: TEXT TO TRANSLATE:
{text}
```
---
Language-specific formality guidance:
- **German**: Use formal German (*Sie*) for professional/technical content
- **Spanish**: Use neutral Spanish suitable for international audiences
- **French**: Use standard French with appropriate formality
- **Italian**: Use standard Italian with professional tone
---
## Data Sources
| Dataset | Config | Samples |
|---|---|---|
| Cascade-SFT-Stage-2 | general / math | 768,615 |
| Dolci-Think-SFT-7B | science | 347,453 |
| Cascade-SFT-Stage-1 | general / code / math / science | 711,812 |
| Llama-Nemotron-PTD | science | 267,147 |
| Nemotron-Science-v1 | — | 97,026 |
| Nemotron-IF-Chat-v1 | — | 91,151 |
| **Total** | | **2,282,204** |
---
## Statistics
| Language | Tokens (B) | Avg. Total Length | Avg. Input | Avg. Output |
|---|---|---|---|---|
| English (`en`) | 9.2 | 4,023 | 424 | 3,599 |
| German (`de`) | 8.8 | 3,866 | 504 | 3,363 |
| French (`fr`) | 8.8 | 3,872 | 493 | 3,379 |
| Spanish (`es`) | 8.7 | 3,796 | 478 | 3,318 |
| Italian (`it`) | 8.5 | 3,742 | 495 | 3,247 |
| **Total** | **44.07** | — | — | — |
The corpus is designed as a **living resource** — the translation pipeline is ongoing, with the full release containing approximately twice as many tokens per language as the current version.
---
## Citation
```bibtex
@misc{reasonxl2026,
title = {Reason{XL}: A Multilingual Cross-Domain Reasoning Corpus},
author = {Daniil Gurgurov and Tom Röhr},
year = {2026},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/datasets/toroe/Soofi-Think-SFT-10B-multilingual}}
}
```
> Paper citation will be added upon publication.
提供机构:
toroe



