five

toroe/Soofi-Think-SFT-V2-firsthalf-FR

收藏
Hugging Face2026-03-11 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/toroe/Soofi-Think-SFT-V2-firsthalf-FR
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - fr license: other task_categories: - text-generation task_ids: - language-modeling tags: - reasoning - thinking - chain-of-thought - sft - translation - french - math - science - code - chat - tool-calling pretty_name: Soofi-Think-SFT-V2-firsthalf-FR size_categories: - 1M<n<10M --- # Soofi-Think-SFT-V2-firsthalf-FR French-translated version of [toroe/Soofi-Think-SFT-V2-firsthalf](https://huggingface.co/datasets/toroe/Soofi-Think-SFT-V2-firsthalf) — a large-scale supervised fine-tuning dataset featuring chain-of-thought reasoning traces (`<think>...</think>`) across math, science, code, tool-calling, and general instruction-following tasks. The translation was produced using **Qwen3-32B** via [vLLM](https://github.com/vllm-project/vllm), applying professional-grade translation prompts targeting standard French suitable for international francophone audiences. --- ## Dataset Summary | Property | Value | |---|---| | Language | French (`fr`) | | Source dataset | `toroe/Soofi-Think-SFT-V2-firsthalf` | | Total rows | ~2.37M | | Translation model | `Qwen/Qwen3-32B` (FP8 quantization) | | Format | Chat-style JSONL (`messages` field) | | Thinking traces | Preserved with `<think>…</think>` tags | --- ## Source Datasets The rows in this dataset originate from a broad blend of high-quality English SFT corpora. The `dataset_name` and `source` fields identify the provenance of each row. Known source collections include: - **Dolci-Think-SFT-7B** — OpenThoughts3 (math, science, code), WildJailbreak R1, WildChat R1, WildGuardMix R1, Aya-100k R1, Persona-precise-IF R1, SYNTHETIC-2-SFT, Nemotron-post-training subset, correct-python-sft - **Nemotron-Cascade-SFT-Stage-1 / Stage-2 (general)** — SlimOrca, HuggingFaceTB/smoltalk, mmlu_auxiliary_train, ShareGPT_Vicuna_unfiltered, GPTeacher-General-Instruct, flan_v2, synthetic, nvidia/Nemotron-Post-Training-Dataset-v1 - **Nemotron-Cascade-SFT-Stage-1 / Stage-2 (math)** — NuminaMath-CoT, OpenMathReasoning - **Nemotron-Cascade-SFT-Stage-1 / Stage-2 (science)** — Nemotron-Post-Training-Dataset-v1-stem, synthetic - **Nemotron-Cascade-SFT-Stage-1 / Stage-2 (code)** — OpenCodeReasoning, leetcode - **Nemotron-Cascade-SFT-Stage-2 (tool-calling)** — Nemotron-Post-Training-Dataset-v1-tool-calling - **Nemotron-Science-v1** — MCQ, RQA - **Llama-Nemotron-Post-Training-Dataset** — Science subset - **Nemotron-Instruction-Following-Chat-v1** — nemotron_v3_chat --- ## Dataset Structure Each row is a JSONL record with the following fields: ```json { "row_index": 0, "dataset_name": "Dolci-Think-SFT-7B", "source": "saumyamalik/OpenThoughts3-full-filtered-science-decontam-v2", "ds_uid": 839609, "language": "french", "messages": [ {"role": "user", "content": "..."}, {"role": "assistant", "content": "<think>\n...\n</think>\n..."} ], "thinking_chunked": null } ``` ### Fields | Field | Type | Description | |---|---|---| | `row_index` | int | Original row index in the source dataset | | `dataset_name` | string | High-level source collection name | | `source` | string | Specific upstream HuggingFace dataset/split | | `ds_uid` | int | Unique ID from the source dataset | | `language` | string | Always `"french"` | | `messages` | list | Chat-format turns: `system` / `user` / `assistant` | | `thinking_chunked` | bool or null | `true` if the `<think>` block was too long to translate in one pass and was split into chunks | ### Message Format Assistant turns that include a reasoning trace are formatted as: ``` <think> [translated reasoning trace] </think> [translated final answer] ``` Tool-calling rows may include a `system` turn with function signatures, which are intentionally left in English as they contain code-like structured content (function names, JSON schemas, identifiers). --- ## Translation Methodology Translation was performed with a custom vLLM-based pipeline. Key design decisions: - **Model:** `Qwen/Qwen3-32B` with FP8 weight quantization and prefix caching enabled - **Decoding:** Near-greedy sampling (temperature `0.1`, top-p `1.0`) for translation stability - **Context management:** Input token budgets are computed as `usable / (1 + output_ratio)` where `output_ratio=1.1`, ensuring sufficient room for output generation - **Long-text chunking:** Fields exceeding the token limit are split on paragraph boundaries (falling back to line, then word boundaries) and translated in chunks, then reassembled. Rows where this occurred are flagged with `thinking_chunked: true` - **Batch efficiency:** All non-chunked fields across a batch are sent to vLLM in a single call; chunked fields are also batched together in large calls to maximize throughput - **Register:** Standard French with appropriate formality, suitable for international francophone audiences - **Preserved elements:** Code, variable names, LaTeX/mathematical notation, file paths, URLs, tool/function signatures, and quoted literals are left in English ### Translation Prompt Guidelines (summary) The system prompt instructed the model to: 1. Output **only** the translated text — no meta-commentary or explanations 2. Translate **all** natural-language prose; leave code, identifiers, and literals unchanged 3. Preserve formatting, tone, and formality level of the original 4. Adapt cultural references appropriately for French-speaking audiences 5. Maintain consistent terminology throughout each document --- ## Intended Uses This dataset is intended for: - **Multilingual SFT / instruction tuning** of language models targeting French-speaking users - **Cross-lingual reasoning** research (chain-of-thought in French) - **Distillation** of reasoning capabilities into smaller French-language models - **Tool-use and function-calling** training in a French context - Benchmarking **translation quality** of reasoning-heavy content --- ## Limitations - Translations are machine-generated and may contain errors, particularly for highly domain-specific or ambiguous content - Very long reasoning traces that required chunked translation (`thinking_chunked: true`) may have minor coherence issues at chunk boundaries - Tool-calling `system` prompts are intentionally kept in English, as they contain structured technical content (JSON schemas, function signatures) that must remain machine-readable - Technical terms and proper nouns are generally preserved in English, which reflects standard practice for French technical writing but may not suit all use cases - The dataset inherits any biases, errors, or quality issues present in the original English source datasets --- ## Citation If you use this dataset, please also cite the original upstream datasets and the Qwen3 model used for translation. ```bibtex @misc{soofi-think-sft-v2-fr, title = {Soofi-Think-SFT-V2-firsthalf-FR}, author = {toroe}, year = {2025}, howpublished = {\url{https://huggingface.co/datasets/toroe/Soofi-Think-SFT-V2-firsthalf-FR}}, note = {French translation of Soofi-Think-SFT-V2-firsthalf using Qwen3-32B via vLLM} } ```
提供机构:
toroe
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作