T-Wix
收藏魔搭社区2025-12-05 更新2025-07-26 收录
下载链接:
https://modelscope.cn/datasets/t-tech/T-Wix
下载链接
链接失效反馈官方服务:
资源简介:
# T-Wix SFT Mixture
🚨 T-Wix is built entirely from publicly available data and intended for use in research and development.
The dataset may contain noise, biases, or artifacts that require careful inspection and preprocessing.
Users are fully responsible for any downstream use and must ensure compliance with ethical, legal, and safety standards.
<div style="text-align: center;">
<img src="https://cdn-uploads.huggingface.co/production/uploads/64fb054ebb362cbf2fe53159/d8OYPhADqqEPKgi4eAqvQ.jpeg" width="400" height="400" style="display: block; margin: 0 auto;">
</div>
### 📝 Dataset Summary
**T‑Wix** is a Russian supervised fine‑tuning (SFT) dataset.
The dataset is divided into 2 sticks:
1. General (468614 samples) — Covering a broad range of topics, including Math, Science, Coding & Programming, General Knowledge, Instruction Following, Roleplay, and more.
2. Reasoning (30984 samples) — Focusing on advanced math and science problems with detailed reasoning traces.
It combines a variety of prompts drawn from open‑source resources and high‑quality Russian translations of English datasets.
The primary goal is to enhance the model’s core capabilities — ranging from solving algorithmic and mathematical problems to dialogue, logical thinking, and reasoning mode.
We also add long context, including summarization and long-form question–answer pairs, with contexts up to 32000 tokens, and corpus of samples in English in the General part.
The total size of the dataset is about **499598** samples in Russian.
### ⚙️ Data Preparation
To ensure high quality, variety and thematic diversity, we applied a multi-stage filtering pipeline for **General** and **Reasoning** data.
#### General Data
- Stage 0: Deduplication. We removed near‑duplicates using locality‑sensitive hashing (LSH) and embedding‑based similarity.
- Stage 1: Diversity Control with #InsTag. We used [Instruction Tagging #InsTag](https://arxiv.org/pdf/2308.07074) to balance themes and styles, preventing any single topic from dominating the data.
- Stage 2: Quality Filtering. A reward model (RM‑score) evaluated each sample, eliminating low‑quality examples.
- Stage 3: Difficulty Selection. We used [Instruction-Following Difficulty (IFD)](https://arxiv.org/pdf/2308.12032v5) to retain the most challenging examples.
- Stage 4: Rejection Sampling. Finally, we generated 8 completions for each prompt with more capable model and choose the best one by RM-score.
#### Reasoning Data
- Stage 0: Deduplication. Same as for General data, we removed near-duplicates using LSH and embedding similarity.
- Stage 1: Prompt-Level Filtering via RM-Score Distributions. For each prompt, 8 completions were generated using a teacher model and 8 using a base model. Each of them was scored using a reward model and normalized using *softmax*. We discarded prompts with overly low or high median scores.
- Stage 2: KL-Based Selection. We computed KL divergence between the teacher and student RM-score distributions and retained prompts with mid-range KL values. This allowed us to select examples that were balanced in complexity.
- Stage 3: Final Completion Selection. For verifiable prompts, correctness was checked. Final answers were selected as the shortest among the top-3 teacher completions ranked by RM-score - balancing quality and conciseness.
Responses were generated with more capable models — such as [DeepSeek‑V3‑0324](https://huggingface.co/deepseek-ai/DeepSeek-V3-0324) and [Qwen3-235B-A22B](https://huggingface.co/Qwen/Qwen3-235B-A22B), to ensure high accuracy and relevance.
<div style="text-align: center;">
<img src="https://cdn-uploads.huggingface.co/production/uploads/64fb054ebb362cbf2fe53159/Ntp6zyORpF2DYUtCiAna7.png" width="600" height="600" style="display: block; margin: 0 auto;">
<p style="font-style: italic; color: gray;">Token counts were computed using <a href="https://github.com/openai/tiktoken" target="_blank" style="color: gray; text-decoration: underline;">tiktoken</a> with <code>o200k_base</code> tokenizer.</p>
</div>
### 📌 Data fields
- `id`(str) — unique ID of sample.
- `messages`(list) — an array of messages, where each message includes a role (system, user or assistant) and content.
- `subset`(str) — `general`, `reasoning`, `long_context` or `english_corpus` subsets.
### 🔐 License
This dataset is licensed under ODC-BY-1.0.
It includes outputs from third-party models, which may be subject to separate terms.
See each subset link for specific licensing details.
### 📚 Sources
- [BAAI/Infinity-Instruct](https://huggingface.co/datasets/BAAI/Infinity-Instruct)
- [rombodawg/Everything_Instruct](https://huggingface.co/datasets/rombodawg/Everything_Instruct)
- [H-D-T/Buzz-V1.2](https://huggingface.co/datasets/H-D-T/Buzz-V1.2)
- [nyuuzyou/chatgpt-in-russia-qa](https://huggingface.co/datasets/nyuuzyou/chatgpt-in-russia-qa)
- [nyuuzyou/ruschatgpt-qa](https://huggingface.co/datasets/nyuuzyou/ruschatgpt-qa)
- [lmsys/lmsys-chat-1m](https://huggingface.co/datasets/lmsys/lmsys-chat-1m)
- [Replete-AI/code_bagel](https://huggingface.co/datasets/Replete-AI/code_bagel)
- [BAAI/TACO](https://huggingface.co/datasets/BAAI/TACO)
- [HuggingFaceTB/smoltalk](https://huggingface.co/datasets/HuggingFaceTB/smoltalk)
- [allenai/WildChat](https://huggingface.co/datasets/allenai/WildChat)
- [allenai/tulu-3-sft-mixture](https://huggingface.co/datasets/allenai/tulu-3-sft-mixture)
- [arcee-ai/The-Tome](https://huggingface.co/datasets/arcee-ai/The-Tome)
- [AI-MO/NuminaMath-CoT](https://huggingface.co/datasets/AI-MO/NuminaMath-CoT)
- [RussianNLP/Mixed-Summarization-Dataset](https://huggingface.co/datasets/RussianNLP/Mixed-Summarization-Dataset)
- [nvidia/Daring-Anteater](https://huggingface.co/datasets/nvidia/Daring-Anteater)
- [stanfordnlp/SHP](https://huggingface.co/datasets/stanfordnlp/SHP)
- [Locutusque/hercules-v6.1](https://huggingface.co/datasets/Locutusque/hercules-v6.1)
- [HuggingFaceH4/no_robots](https://huggingface.co/datasets/HuggingFaceH4/no_robots)
- [meta-math/MetaMathQA](https://huggingface.co/datasets/meta-math/MetaMathQA)
- [teknium/OpenHermes-2.5](https://huggingface.co/datasets/teknium/OpenHermes-2.5)
- [OpenAssistant/oasst2](https://huggingface.co/datasets/OpenAssistant/oasst2)
- [nvidia/HelpSteer2](https://huggingface.co/datasets/nvidia/HelpSteer2)
- [openbmb/UltraFeedback](https://huggingface.co/datasets/openbmb/UltraFeedback)
- [argilla/Capybara-Preferences](https://huggingface.co/datasets/argilla/Capybara-Preferences)
- [m-a-p/CodeFeedback-Filtered-Instruction](https://huggingface.co/datasets/m-a-p/CodeFeedback-Filtered-Instruction)
- [Magpie-Align/Magpie-Reasoning-V2-250K-CoT-Llama3](https://huggingface.co/datasets/Magpie-Align/Magpie-Reasoning-V2-250K-CoT-Llama3)
- [https://huggingface.co/datasets/IlyaGusev/gpt_roleplay_realm](https://huggingface.co/datasets/IlyaGusev/gpt_roleplay_realm)
- [TIGER-Lab/WebInstruct-verified](https://huggingface.co/datasets/TIGER-Lab/WebInstruct-verified)
- [nvidia/AceReason-Math](https://huggingface.co/datasets/nvidia/AceReason-Math)
- [nvidia/Llama-Nemotron-Post-Training-Dataset](https://huggingface.co/datasets/nvidia/Llama-Nemotron-Post-Training-Dataset)
- [EricLu/SCP-116K](https://huggingface.co/datasets/EricLu/SCP-116K)
- [open-r1/OpenR1-Math-220k](https://huggingface.co/datasets/open-r1/OpenR1-Math-220k)
- [open-r1/OpenThoughts-114k-math](https://huggingface.co/datasets/open-r1/OpenThoughts-114k-math)
- [open-r1/codeforces-cots](https://huggingface.co/datasets/open-r1/codeforces-cots)
- [LLM360/guru-RL-92k](https://huggingface.co/datasets/LLM360/guru-RL-92k)
- [KodCode/KodCode-V1-SFT-R1](https://huggingface.co/datasets/KodCode/KodCode-V1-SFT-R1)
⚠️ T-Wix 完全基于公开可用数据构建,仅用于研发场景。本数据集可能包含噪声、偏差或人工伪影,需经仔细检查与预处理后方可使用。用户需对下游使用负完全责任,并确保其符合伦理、法律及安全标准。
<div style="text-align: center;">
<img src="https://cdn-uploads.huggingface.co/production/uploads/64fb054ebb362cbf2fe53159/d8OYPhADqqEPKgi4eAqvQ.jpeg" width="400" height="400" style="display: block; margin: 0 auto;">
</div>
### 📝 数据集概述
**T‑Wix** 是一款俄语监督微调(Supervised Fine-Tuning,SFT)数据集。
本数据集分为两个子集:
1. **通用(General)子集(468614条样本)** —— 涵盖数学、科学、编码与编程、通用知识、指令遵循、角色扮演等广泛主题。
2. **推理(Reasoning)子集(30984条样本)** —— 聚焦带有详细推理轨迹的高等数学与科学问题。
该数据集整合了开源资源中的各类提示词,以及英语数据集的高质量俄语译文,核心目标是增强模型的核心能力,涵盖算法与数学问题求解、对话、逻辑思考与推理模式等方向。
我们还新增了长上下文内容,包括摘要与长格式问答对,上下文长度可达32000个词元(Token),且通用子集中包含英语语料库样本。
本数据集的俄语样本总规模约为 **499598** 条。
### ⚙️ 数据制备流程
为确保数据集的高质量、多样性与主题覆盖广度,我们针对**通用**与**推理**两类数据分别采用了多阶段过滤流程。
#### 通用数据处理流程
- **阶段0:去重**:通过局部敏感哈希(Locality-Sensitive Hashing,LSH)与基于嵌入的相似度方法移除近似重复样本。
- **阶段1:基于#InsTag的多样性控制**:使用[指令标注#InsTag](https://arxiv.org/pdf/2308.07074)平衡数据集的主题与风格,避免单一主题主导数据分布。
- **阶段2:质量过滤**:通过奖励模型(Reward Model,RM)评分(RM-score)对每条样本进行评估,剔除低质量样本。
- **阶段3:难度筛选**:使用[指令遵循难度(Instruction-Following Difficulty,IFD)](https://arxiv.org/pdf/2308.12032v5)筛选并保留高难度样本。
- **阶段4:拒绝采样**:使用更强能力的模型为每条提示词生成8条回复,并通过RM-score选取最优回复。
#### 推理数据处理流程
- **阶段0:去重**:与通用数据处理流程一致,通过LSH与嵌入相似度方法移除近似重复样本。
- **阶段1:基于RM评分分布的提示词级过滤**:针对每条提示词,使用教师模型生成8条回复、基础模型生成8条回复,再通过奖励模型对所有回复进行评分,并使用*softmax*进行归一化。剔除评分中位数过高或过低的提示词。
- **阶段2:基于KL散度的筛选**:计算教师模型与基础模型的RM评分分布之间的KL散度,保留KL值处于中等区间的提示词,以此筛选出复杂度均衡的样本。
- **阶段3:最终回复选择**:对于可验证的提示词,检查其答案正确性。最终答案选取RM-score排名前三的教师模型回复中最短的一条,兼顾回复质量与简洁性。
回复由更强能力的模型生成,例如[DeepSeek‑V3‑0324](https://huggingface.co/deepseek-ai/DeepSeek-V3-0324)与[Qwen3-235B-A22B](https://huggingface.co/Qwen/Qwen3-235B-A22B),以确保回复的高准确性与相关性。
<div style="text-align: center;">
<img src="https://cdn-uploads.huggingface.co/production/uploads/64fb054ebb362cbf2fe53159/Ntp6zyORpF2DYUtCiAna7.png" width="600" height="600" style="display: block; margin: 0 auto;">
<p style="font-style: italic; color: gray;">*注:词元计数通过[tiktoken](https://github.com/openai/tiktoken)与`o200k_base`分词器计算得到。*</p>
</div>
### 📌 数据字段
- `id`(str) — 样本唯一标识符。
- `messages`(list) — 消息数组,每条消息包含角色(系统、用户或助手)与内容。
- `subset`(str) — 子集类型,包括`general`(通用)、`reasoning`(推理)、`long_context`(长上下文)或`english_corpus`(英语语料库)。
### 🔐 许可协议
本数据集采用ODC-BY-1.0许可协议进行授权。其包含第三方模型生成的输出,此类输出可能受单独的条款约束。有关各子集的具体许可细节,请参阅对应子集的链接。
### 📚 数据源
- [BAAI/Infinity-Instruct](https://huggingface.co/datasets/BAAI/Infinity-Instruct)
- [rombodawg/Everything_Instruct](https://huggingface.co/datasets/rombodawg/Everything_Instruct)
- [H-D-T/Buzz-V1.2](https://huggingface.co/datasets/H-D-T/Buzz-V1.2)
- [nyuuzyou/chatgpt-in-russia-qa](https://huggingface.co/datasets/nyuuzyou/chatgpt-in-russia-qa)
- [nyuuzyou/ruschatgpt-qa](https://huggingface.co/datasets/nyuuzyou/ruschatgpt-qa)
- [lmsys/lmsys-chat-1m](https://huggingface.co/datasets/lmsys/lmsys-chat-1m)
- [Replete-AI/code_bagel](https://huggingface.co/datasets/Replete-AI/code_bagel)
- [BAAI/TACO](https://huggingface.co/datasets/BAAI/TACO)
- [HuggingFaceTB/smoltalk](https://huggingface.co/datasets/HuggingFaceTB/smoltalk)
- [allenai/WildChat](https://huggingface.co/datasets/allenai/WildChat)
- [allenai/tulu-3-sft-mixture](https://huggingface.co/datasets/allenai/tulu-3-sft-mixture)
- [arcee-ai/The-Tome](https://huggingface.co/datasets/arcee-ai/The-Tome)
- [AI-MO/NuminaMath-CoT](https://huggingface.co/datasets/AI-MO/NuminaMath-CoT)
- [RussianNLP/Mixed-Summarization-Dataset](https://huggingface.co/datasets/RussianNLP/Mixed-Summarization-Dataset)
- [nvidia/Daring-Anteater](https://huggingface.co/datasets/nvidia/Daring-Anteater)
- [stanfordnlp/SHP](https://huggingface.co/datasets/stanfordnlp/SHP)
- [Locutusque/hercules-v6.1](https://huggingface.co/datasets/Locutusque/hercules-v6.1)
- [HuggingFaceH4/no_robots](https://huggingface.co/datasets/HuggingFaceH4/no_robots)
- [meta-math/MetaMathQA](https://huggingface.co/datasets/meta-math/MetaMathQA)
- [teknium/OpenHermes-2.5](https://huggingface.co/datasets/teknium/OpenHermes-2.5)
- [OpenAssistant/oasst2](https://huggingface.co/datasets/OpenAssistant/oasst2)
- [nvidia/HelpSteer2](https://huggingface.co/datasets/nvidia/HelpSteer2)
- [openbmb/UltraFeedback](https://huggingface.co/datasets/openbmb/UltraFeedback)
- [argilla/Capybara-Preferences](https://huggingface.co/datasets/argilla/Capybara-Preferences)
- [m-a-p/CodeFeedback-Filtered-Instruction](https://huggingface.co/datasets/m-a-p/CodeFeedback-Filtered-Instruction)
- [Magpie-Align/Magpie-Reasoning-V2-250K-CoT-Llama3](https://huggingface.co/datasets/Magpie-Align/Magpie-Reasoning-V2-250K-CoT-Llama3)
- [https://huggingface.co/datasets/IlyaGusev/gpt_roleplay_realm](https://huggingface.co/datasets/IlyaGusev/gpt_roleplay_realm)
- [TIGER-Lab/WebInstruct-verified](https://huggingface.co/datasets/TIGER-Lab/WebInstruct-verified)
- [nvidia/AceReason-Math](https://huggingface.co/datasets/nvidia/AceReason-Math)
- [nvidia/Llama-Nemotron-Post-Training-Dataset](https://huggingface.co/datasets/nvidia/Llama-Nemotron-Post-Training-Dataset)
- [EricLu/SCP-116K](https://huggingface.co/datasets/EricLu/SCP-116K)
- [open-r1/OpenR1-Math-220k](https://huggingface.co/datasets/open-r1/OpenR1-Math-220k)
- [open-r1/OpenThoughts-114k-math](https://huggingface.co/datasets/open-r1/OpenThoughts-114k-math)
- [open-r1/codeforces-cots](https://huggingface.co/datasets/open-r1/codeforces-cots)
- [LLM360/guru-RL-92k](https://huggingface.co/datasets/LLM360/guru-RL-92k)
- [KodCode/KodCode-V1-SFT-R1](https://huggingface.co/datasets/KodCode/KodCode-V1-SFT-R1)
提供机构:
maas
创建时间:
2025-07-19



