toroe/Soofi-Think-SFT-V2-firsthalf-FR
收藏Hugging Face2026-03-11 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/toroe/Soofi-Think-SFT-V2-firsthalf-FR
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- fr
license: other
task_categories:
- text-generation
task_ids:
- language-modeling
tags:
- reasoning
- thinking
- chain-of-thought
- sft
- translation
- french
- math
- science
- code
- chat
- tool-calling
pretty_name: Soofi-Think-SFT-V2-firsthalf-FR
size_categories:
- 1M<n<10M
---
# Soofi-Think-SFT-V2-firsthalf-FR
French-translated version of [toroe/Soofi-Think-SFT-V2-firsthalf](https://huggingface.co/datasets/toroe/Soofi-Think-SFT-V2-firsthalf) — a large-scale supervised fine-tuning dataset featuring chain-of-thought reasoning traces (`<think>...</think>`) across math, science, code, tool-calling, and general instruction-following tasks.
The translation was produced using **Qwen3-32B** via [vLLM](https://github.com/vllm-project/vllm), applying professional-grade translation prompts targeting standard French suitable for international francophone audiences.
---
## Dataset Summary
| Property | Value |
|---|---|
| Language | French (`fr`) |
| Source dataset | `toroe/Soofi-Think-SFT-V2-firsthalf` |
| Total rows | ~2.37M |
| Translation model | `Qwen/Qwen3-32B` (FP8 quantization) |
| Format | Chat-style JSONL (`messages` field) |
| Thinking traces | Preserved with `<think>…</think>` tags |
---
## Source Datasets
The rows in this dataset originate from a broad blend of high-quality English SFT corpora. The `dataset_name` and `source` fields identify the provenance of each row. Known source collections include:
- **Dolci-Think-SFT-7B** — OpenThoughts3 (math, science, code), WildJailbreak R1, WildChat R1, WildGuardMix R1, Aya-100k R1, Persona-precise-IF R1, SYNTHETIC-2-SFT, Nemotron-post-training subset, correct-python-sft
- **Nemotron-Cascade-SFT-Stage-1 / Stage-2 (general)** — SlimOrca, HuggingFaceTB/smoltalk, mmlu_auxiliary_train, ShareGPT_Vicuna_unfiltered, GPTeacher-General-Instruct, flan_v2, synthetic, nvidia/Nemotron-Post-Training-Dataset-v1
- **Nemotron-Cascade-SFT-Stage-1 / Stage-2 (math)** — NuminaMath-CoT, OpenMathReasoning
- **Nemotron-Cascade-SFT-Stage-1 / Stage-2 (science)** — Nemotron-Post-Training-Dataset-v1-stem, synthetic
- **Nemotron-Cascade-SFT-Stage-1 / Stage-2 (code)** — OpenCodeReasoning, leetcode
- **Nemotron-Cascade-SFT-Stage-2 (tool-calling)** — Nemotron-Post-Training-Dataset-v1-tool-calling
- **Nemotron-Science-v1** — MCQ, RQA
- **Llama-Nemotron-Post-Training-Dataset** — Science subset
- **Nemotron-Instruction-Following-Chat-v1** — nemotron_v3_chat
---
## Dataset Structure
Each row is a JSONL record with the following fields:
```json
{
"row_index": 0,
"dataset_name": "Dolci-Think-SFT-7B",
"source": "saumyamalik/OpenThoughts3-full-filtered-science-decontam-v2",
"ds_uid": 839609,
"language": "french",
"messages": [
{"role": "user", "content": "..."},
{"role": "assistant", "content": "<think>\n...\n</think>\n..."}
],
"thinking_chunked": null
}
```
### Fields
| Field | Type | Description |
|---|---|---|
| `row_index` | int | Original row index in the source dataset |
| `dataset_name` | string | High-level source collection name |
| `source` | string | Specific upstream HuggingFace dataset/split |
| `ds_uid` | int | Unique ID from the source dataset |
| `language` | string | Always `"french"` |
| `messages` | list | Chat-format turns: `system` / `user` / `assistant` |
| `thinking_chunked` | bool or null | `true` if the `<think>` block was too long to translate in one pass and was split into chunks |
### Message Format
Assistant turns that include a reasoning trace are formatted as:
```
<think>
[translated reasoning trace]
</think>
[translated final answer]
```
Tool-calling rows may include a `system` turn with function signatures, which are intentionally left in English as they contain code-like structured content (function names, JSON schemas, identifiers).
---
## Translation Methodology
Translation was performed with a custom vLLM-based pipeline. Key design decisions:
- **Model:** `Qwen/Qwen3-32B` with FP8 weight quantization and prefix caching enabled
- **Decoding:** Near-greedy sampling (temperature `0.1`, top-p `1.0`) for translation stability
- **Context management:** Input token budgets are computed as `usable / (1 + output_ratio)` where `output_ratio=1.1`, ensuring sufficient room for output generation
- **Long-text chunking:** Fields exceeding the token limit are split on paragraph boundaries (falling back to line, then word boundaries) and translated in chunks, then reassembled. Rows where this occurred are flagged with `thinking_chunked: true`
- **Batch efficiency:** All non-chunked fields across a batch are sent to vLLM in a single call; chunked fields are also batched together in large calls to maximize throughput
- **Register:** Standard French with appropriate formality, suitable for international francophone audiences
- **Preserved elements:** Code, variable names, LaTeX/mathematical notation, file paths, URLs, tool/function signatures, and quoted literals are left in English
### Translation Prompt Guidelines (summary)
The system prompt instructed the model to:
1. Output **only** the translated text — no meta-commentary or explanations
2. Translate **all** natural-language prose; leave code, identifiers, and literals unchanged
3. Preserve formatting, tone, and formality level of the original
4. Adapt cultural references appropriately for French-speaking audiences
5. Maintain consistent terminology throughout each document
---
## Intended Uses
This dataset is intended for:
- **Multilingual SFT / instruction tuning** of language models targeting French-speaking users
- **Cross-lingual reasoning** research (chain-of-thought in French)
- **Distillation** of reasoning capabilities into smaller French-language models
- **Tool-use and function-calling** training in a French context
- Benchmarking **translation quality** of reasoning-heavy content
---
## Limitations
- Translations are machine-generated and may contain errors, particularly for highly domain-specific or ambiguous content
- Very long reasoning traces that required chunked translation (`thinking_chunked: true`) may have minor coherence issues at chunk boundaries
- Tool-calling `system` prompts are intentionally kept in English, as they contain structured technical content (JSON schemas, function signatures) that must remain machine-readable
- Technical terms and proper nouns are generally preserved in English, which reflects standard practice for French technical writing but may not suit all use cases
- The dataset inherits any biases, errors, or quality issues present in the original English source datasets
---
## Citation
If you use this dataset, please also cite the original upstream datasets and the Qwen3 model used for translation.
```bibtex
@misc{soofi-think-sft-v2-fr,
title = {Soofi-Think-SFT-V2-firsthalf-FR},
author = {toroe},
year = {2025},
howpublished = {\url{https://huggingface.co/datasets/toroe/Soofi-Think-SFT-V2-firsthalf-FR}},
note = {French translation of Soofi-Think-SFT-V2-firsthalf using Qwen3-32B via vLLM}
}
```
提供机构:
toroe



