five

Arko007/zenyx-v2-raw-datasets

收藏
Hugging Face2026-03-24 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Arko007/zenyx-v2-raw-datasets
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 configs: - config_name: cascade_conversational_agent data_files: - split: train path: cascade_conversational_agent/train-* - config_name: cascade_instruction_following data_files: - split: train path: cascade_instruction_following/train-* - config_name: cascade_math data_files: - split: train path: cascade_math/train-* - config_name: cascade_safety data_files: - split: train path: cascade_safety/train-* - config_name: cascade_science data_files: - split: train path: cascade_science/train-* - config_name: cascade_swe data_files: - split: train path: cascade_swe/train-* - config_name: cascade_terminal_agent data_files: - split: train path: cascade_terminal_agent/train-* - config_name: nemotron_rl data_files: - split: instruction_following path: nemotron_rl/instruction_following-* - config_name: nemotron_sft_chat data_files: - split: chat path: nemotron_sft_chat/chat-* - config_name: nemotron_sft_code data_files: - split: code path: nemotron_sft_code/code-* - config_name: nemotron_sft_math data_files: - split: math path: nemotron_sft_math/math-* - config_name: nemotron_sft_safety data_files: - split: safety path: nemotron_sft_safety/safety-* - config_name: nemotron_sft_science data_files: - split: science path: nemotron_sft_science/science-* dataset_info: - config_name: cascade_conversational_agent features: - name: domain dtype: string - name: source dtype: string - name: messages list: - name: role dtype: string - name: content dtype: string - name: generator dtype: string splits: - name: train num_bytes: 16610848715 num_examples: 822213 download_size: 16572224011 dataset_size: 16610848715 - config_name: cascade_instruction_following features: - name: domain dtype: string - name: source dtype: string - name: messages list: - name: role dtype: string - name: content dtype: string - name: generator dtype: string splits: - name: train num_bytes: 3291121703 num_examples: 820263 download_size: 3235443211 dataset_size: 3291121703 - config_name: cascade_math features: - name: domain dtype: string - name: source dtype: string - name: messages list: - name: role dtype: string - name: content dtype: string - name: generator dtype: string splits: - name: train num_bytes: 234238494601 num_examples: 5226364 download_size: 233946106949 dataset_size: 234238494601 - config_name: cascade_safety features: - name: domain dtype: string - name: source dtype: string - name: messages list: - name: role dtype: string - name: content dtype: string - name: generator dtype: string splits: - name: train num_bytes: 14072047 num_examples: 3570 download_size: 13940007 dataset_size: 14072047 - config_name: cascade_science features: - name: domain dtype: string - name: source dtype: string - name: messages list: - name: role dtype: string - name: content dtype: string - name: generator dtype: string splits: - name: train num_bytes: 44633958496 num_examples: 2717163 download_size: 44500162446 dataset_size: 44633958496 - config_name: cascade_swe features: - name: domain dtype: string - name: source dtype: string - name: messages list: - name: role dtype: string - name: content dtype: string - name: generator dtype: string splits: - name: train num_bytes: 35010223613 num_examples: 439610 download_size: 34984642651 dataset_size: 35010223613 - config_name: cascade_terminal_agent features: - name: domain dtype: string - name: source dtype: string - name: messages list: - name: role dtype: string - name: content dtype: string - name: generator dtype: string splits: - name: train num_bytes: 29404143854 num_examples: 485667 download_size: 29376838676 dataset_size: 29404143854 - config_name: nemotron_rl features: - name: input list: - name: role dtype: string - name: content dtype: string - name: args struct: - name: instruction_id_list list: string - name: instruction_kwargs list: json - name: task dtype: string - name: num_requirements dtype: int64 - name: category dtype: string - name: license dtype: string - name: reasoning dtype: string - name: used_in_training dtype: string - name: version dtype: string - name: system_prompt dtype: string splits: - name: instruction_following num_bytes: 164592594 num_examples: 56339 download_size: 158974946 dataset_size: 164592594 - config_name: nemotron_sft_chat features: - name: input list: - name: role dtype: string - name: content dtype: string - name: output dtype: string - name: category dtype: string - name: license dtype: string - name: reasoning dtype: string - name: generator dtype: string - name: used_in_training dtype: string - name: version dtype: string - name: system_prompt dtype: string splits: - name: chat num_bytes: 245046303 num_examples: 39792 download_size: 169828290 dataset_size: 245046303 - config_name: nemotron_sft_code features: - name: input list: - name: role dtype: string - name: content dtype: string - name: output dtype: string - name: category dtype: string - name: license dtype: string - name: reasoning dtype: string - name: generator dtype: string - name: used_in_training dtype: string - name: version dtype: string - name: system_prompt dtype: string splits: - name: code num_bytes: 45865777355 num_examples: 10108883 download_size: 23565003450 dataset_size: 45865777355 - config_name: nemotron_sft_math features: - name: input list: - name: role dtype: string - name: content dtype: string - name: output dtype: string - name: category dtype: string - name: license dtype: string - name: reasoning dtype: string - name: generator dtype: string - name: used_in_training dtype: string - name: version dtype: string - name: system_prompt dtype: string splits: - name: math num_bytes: 70454610238 num_examples: 22066397 download_size: 33049334526 dataset_size: 70454610238 - config_name: nemotron_sft_safety features: - name: input list: - name: role dtype: string - name: content dtype: string - name: output dtype: string - name: category dtype: string - name: license dtype: string - name: reasoning dtype: string - name: generator dtype: string - name: used_in_training dtype: string - name: version dtype: string - name: system_prompt dtype: string splits: - name: safety num_bytes: 53022448 num_examples: 31426 download_size: 26165302 dataset_size: 53022448 - config_name: nemotron_sft_science features: - name: input list: - name: role dtype: string - name: content dtype: string - name: output dtype: string - name: category dtype: string - name: license dtype: string - name: reasoning dtype: string - name: generator dtype: string - name: used_in_training dtype: string - name: version dtype: string - name: system_prompt dtype: string splits: - name: science num_bytes: 5858893209 num_examples: 708920 download_size: 2936806260 dataset_size: 5858893209 task_categories: - text-generation language: - en tags: - zenyx - sft - instruction-following - math - code - science - reasoning - agent pretty_name: Zenyx V2 SFT Raw Dataset Collection size_categories: - 100M<n<1B --- # Zenyx V2 — Raw SFT Dataset Collection This is the unified raw dataset collection used for training **Zenyx V2**, a custom large language model built from scratch with a novel architecture. ## Dataset Sources | Dataset | Rows | Category | |---------|------|----------| | nemotron_sft_code | 10,108,883 | Code | | nemotron_sft_math | 22,066,397 | Math | | nemotron_sft_science | 708,920 | Science | | nemotron_sft_chat | 39,792 | Chat | | nemotron_sft_safety | 31,426 | Safety | | nemotron_rl | 56,339 | Instruction Following (RL) | | cascade_math | 5,226,364 | Math (Cascade) | | cascade_science | 2,717,163 | Science (Cascade) | | cascade_instruction_following | 820,263 | Instruction Following | | cascade_safety | 3,570 | Safety | | cascade_conversational_agent | 822,213 | Conversational Agent | | cascade_swe | 439,610 | Software Engineering | | cascade_terminal_agent | 485,667 | Terminal Agent | | redmod_math | 143,055,882 | Math (Thinking + Non-Thinking) | ## Column Schemas **Nemotron SFT** — `input`, `output`, `category`, `license`, `reasoning`, `generator`, `used_in_training`, `version`, `system_prompt` **Nemotron Cascade** — `domain`, `source`, `messages`, `generator` **RedMod Math** — `text` ## About Zenyx V2 Zenyx is a custom LLM built with a novel architecture featuring: - Custom tokenizer - Modified attention mechanism - Trained entirely on curated open-source data > This dataset is for research purposes. All source datasets retain > their original licenses. ## Missing (Next Session) - `cascade_chat` (~200GB, download interrupted) - `openO1` (corrupt JSON issue, fix pending) - `stepfun_sft` (OOM issue, fix pending)
提供机构:
Arko007
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作