five

AMAImedia/NOESIS-1M-reasoning-router-code-math-psych-opus47-deepseek4-qwen36-gemini31-r1-gpt54

收藏
Hugging Face2026-04-27 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/AMAImedia/NOESIS-1M-reasoning-router-code-math-psych-opus47-deepseek4-qwen36-gemini31-r1-gpt54
下载链接
链接失效反馈
官方服务:
资源简介:
NOESIS DORA SFT数据集是一个多语言监督微调数据集,专为NOESIS QwQ+DeepSeek-R1 MoE管道设计。该数据集是NOESIS专业多语言配音自动化平台(采用DHCF-FNO框架)的一部分。数据集包含1,000,000条完整记录和50,000条精选高质量子集,涵盖多种语言(如英语、中文、俄语、阿拉伯语、印地语、西班牙语、法语、德语、日语、韩语、土耳其语、越南语、波斯语、意大利语、葡萄牙语、印尼语、孟加拉语、泰语、乌克兰语、波兰语、荷兰语、泰米尔语、马来语、斯瓦希里语、豪萨语、古吉拉特语、哈萨克语、乌兹别克语、马拉地语、乌尔都语)。数据格式为JSONL,每条记录包含用户问题和助手回答,部分记录包含来自QwQ-32B/DeepSeek-R1的推理轨迹(用`<think>...</think>`标记)。数据集来源多样,包括Aya数据集(约197,000条,多语言指令)、Claude Sonnet 4.6 SFT(约122,000条,高质量英语助手对话)、DeepSeek-R1-Distill-7B合成数据(约41,000条,带推理轨迹)、NOESIS翻译对(约46,000条,30种语言平行SFT)、Claude Opus 4.7思考数据(约25,000条,扩展推理轨迹)以及其他代码、数学、研究来源。50K精选子集采用质量评分策略(如包含推理轨迹、长回答、代码或数学内容加分)和语言配额(确保30种语言的代表性)。数据集旨在用于DoRA SFT微调、路由器微调(CMoE架构)以及多语言指令调优与推理轨迹蒸馏,主要目标为NOESIS-QwQ-R1管道。许可证为Apache 2.0。

The NOESIS DORA SFT Dataset is a multilingual supervised fine-tuning dataset built for the NOESIS QwQ+DeepSeek-R1 MoE pipeline. It is part of the NOESIS Professional Multilingual Dubbing Automation Platform (framework: DHCF-FNO — Deterministic Hybrid Control Framework for Frozen Neural Operators). The dataset contains 1,000,000 full records and a 50,000 curated high-quality subset, covering multiple languages (e.g., English, Russian, Chinese, Arabic, Hindi, Spanish, French, German, Japanese, Korean, Turkish, Vietnamese, Persian, Italian, Portuguese, Indonesian, Bengali, Thai, Ukrainian, Polish, Dutch, Tamil, Malay, Swahili, Hausa, Gujarati, Kazakh, Uzbek, Marathi, Urdu). The format is JSONL, with each record containing a user question and assistant answer; some records include reasoning traces from QwQ-32B/DeepSeek-R1 (marked with `<think>...</think>`). The dataset composition includes diverse sources: Aya dataset (~197k records, multilingual instruction), Claude Sonnet 4.6 SFT (~122k, high-quality English assistant turns), DeepSeek-R1-Distill-7B synthetic data (~41k, reasoning traces), NOESIS translation pairs (~46k, 30-language parallel SFT), Claude Opus 4.7 thinking (~25k, extended reasoning traces), and other code, math, and research sources. The 50K curated subset uses a quality scoring strategy (e.g., bonus points for reasoning traces, long responses, code, or math content) and language quotas (ensuring representation across 30 languages). The dataset is designed for DoRA SFT fine-tuning, router fine-tuning (gate.weight training for CMoE architectures), and multilingual instruction tuning with reasoning trace distillation, primarily targeting the NOESIS-QwQ-R1 pipeline. License: Apache 2.0.
提供机构:
AMAImedia
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作