five

AMAImedia/NOESIS-50K-reasoning-router-code-math-psych-opus47-deepseek4-qwen36-gemini31-r1-gpt54

收藏
Hugging Face2026-04-27 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/AMAImedia/NOESIS-50K-reasoning-router-code-math-psych-opus47-deepseek4-qwen36-gemini31-r1-gpt54
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 language: - en - ru - zh - ar - hi - es - fr - de - ja - ko - tr - vi - fa - it - pt - id - bn - th - uk - pl - nl - ta - ms - sw - ha - gu - kk - uz - mr - ur pretty_name: NOESIS Multilingual Reasoning Router SFT Dataset (1M + 50K curated) size_categories: - 1M<n<10M task_categories: - text-generation tags: - noesis - sft - instruction-tuning - multilingual - reasoning - code - math - chain-of-thought - moe - dhcf-fno --- # NOESIS DORA SFT Dataset **Multilingual supervised fine-tuning dataset built for the NOESIS QwQ+DeepSeek-R1 MoE pipeline.** Released as part of the **NOESIS Professional Multilingual Dubbing Automation Platform** (framework: DHCF-FNO — Deterministic Hybrid Control Framework for Frozen Neural Operators). - **Founder:** Ilia Bolotnikov - **Organization:** [AMAImedia.com](https://www.amaimedia.com) - **X (Twitter):** [@AMAImediacom](https://x.com/AMAImediacom) - **LinkedIn:** [Ilia Bolotnikov](https://www.linkedin.com/in/ilia-bolotnikov) - **Telegram:** [@djbionicl](https://t.me/djbionicl) - **NOESIS version:** v14.8-NT89 - **Build date:** 2026-04 --- ## Files | File | Records | Size | Description | | --- | --- | --- | --- | | `NOESIS-1M-multilingual-reasoning-router-general-code-math-psych-aya-sft-claude-sonet46-opus47-deepseek4-qwen36-gemini31-r1-gpt54.jsonl` | 1,000,000 | ~1.4 GB | Full 1M SFT dataset (2026-04-18) | | `NOESIS-50K-multilingual-reasoning-router-general-code-math-psych-aya-sft-claude-sonet46-opus47-deepseek4-qwen36-gemini31-r1-gpt54.jsonl` | 50,000 | ~66 MB | 50K curated high-quality subset, top-30 language quotas | --- ## Format All files are JSONL — one JSON object per line: ```json {"text": "User: <question>\nAssistant: <answer>"} ``` Records with `<think>...</think>` blocks contain reasoning traces from QwQ-32B / DeepSeek-R1 heritage. --- ## Dataset composition (1M) | Source | Records | Notes | | --- | --- | --- | | Aya dataset (Cohere, 204k) | ~197,000 | Multilingual instruction, 101 languages | | Claude Sonnet 4.6 SFT | ~122,000 | High-quality EN assistant turns | | DeepSeek-R1-Distill-7B synthetic | ~41,000 | Reasoning traces with `<think>` | | NOESIS translation pairs (50k) | ~46,000 | 30-language parallel SFT | | Claude Opus 4.7 thinking | ~25,000 | Extended reasoning traces | | Other SFT sources | ~36,000 | Code, math, research | | Additional mixed sources | ~533,000 | Rebalanced multilingual SFT | --- ## 50k curated selection strategy The `NOESIS-50K-multilingual-...-gpt54.jsonl` is sampled from the 1M with quality scoring: **Quality score (higher = selected first):** - +3 if `<think>...</think>` present (reasoning trace) - +2 if assistant response > 2000 chars - +1 if assistant response > 500 chars - +1 if contains code (``` or def/function/class) - +1 if contains math (LaTeX symbols, ∑ ∫ ≤ ≥) **Language quotas (35,000 total across top-30 languages):** | Lang | Quota | | Lang | Quota | | Lang | Quota | | --- | --- | --- | --- | --- | --- | --- | --- | | EN | 8,000 | | ID | 1,000 | | UK | 500 | | ZH | 4,000 | | DE | 1,000 | | PL | 500 | | HI | 2,500 | | JA | 1,000 | | NL | 500 | | ES | 2,500 | | KO | 800 | | TA | 500 | | AR | 2,000 | | TR | 800 | | MS | 400 | | FR | 2,000 | | VI | 800 | | SW | 400 | | RU | 1,500 | | FA | 700 | | HA | 400 | | PT | 1,500 | | IT | 700 | | GU | 400 | | | | | BN | 600 | | KK | 400 | | | | | TH | 600 | | UZ | 400 | | | | | | | | MR | 400 | | | | | | | | UR | 400 | **English high-quality pool:** 15,000 records (reasoning/code/math priority) --- ## Contributing AI models Synthetic SFT records in this dataset were generated by or distilled from outputs of: | Model | Usage | | --- | --- | | Claude Sonnet 4.6 | High-quality EN instruction, coding, analysis | | Claude Opus 4.7 (thinking) | Extended reasoning traces | | DeepSeek-R1 / R1-Distill | `<think>` reasoning chain records | | DeepSeek V4 | General instruction, coding, and reasoning | | Qwen3.6 | Multilingual and reasoning SFT | | Gemini 3.1 | General instruction and research | | GPT-5.4 | Diverse instruction-following turns | --- ## Intended use This dataset is designed for: - DoRA SFT fine-tuning of Qwen3-based MoE models - Router fine-tuning (gate.weight training) for CMoE architectures - Multilingual instruction tuning with reasoning trace distillation Primary target: NOESIS-QwQ-R1 pipeline (QwQ-32B + DeepSeek-R1-32B TIES merge → CMoE 16E). --- ## License Apache License 2.0. Dataset composition includes records derived from: - [Aya dataset](https://huggingface.co/datasets/CohereForAI/aya_dataset) — Apache 2.0, Cohere - Original NOESIS synthetic data — Apache 2.0, AMAImedia.com 2026 See `LICENSE` file for full terms. --- ## HuggingFace repos | Dataset | HuggingFace repo | | --- | --- | | 1M full | `AMAImedia/NOESIS-1M-reasoning-router-code-math-psych-opus47-deepseek4-qwen36-gemini31-r1-gpt54` | | 50K curated | `AMAImedia/NOESIS-50K-reasoning-router-code-math-psych-opus47-deepseek4-qwen36-gemini31-r1-gpt54` | *Note: HuggingFace enforces a 96-character repo ID limit. The full dataset name is encoded in the filename.* --- ## Citation ```bibtex @misc{noesis_dora_dataset_2026, title = {NOESIS DORA SFT Dataset — 1M multilingual instruction-tuning records}, author = {Bolotnikov, Ilia}, year = {2026}, publisher = {AMAImedia}, url = {https://amaimedia.com} } ```
提供机构:
AMAImedia
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作