Arko007/zenyx-v2-raw-datasets
收藏Hugging Face2026-03-24 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Arko007/zenyx-v2-raw-datasets
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
configs:
- config_name: cascade_conversational_agent
data_files:
- split: train
path: cascade_conversational_agent/train-*
- config_name: cascade_instruction_following
data_files:
- split: train
path: cascade_instruction_following/train-*
- config_name: cascade_math
data_files:
- split: train
path: cascade_math/train-*
- config_name: cascade_safety
data_files:
- split: train
path: cascade_safety/train-*
- config_name: cascade_science
data_files:
- split: train
path: cascade_science/train-*
- config_name: cascade_swe
data_files:
- split: train
path: cascade_swe/train-*
- config_name: cascade_terminal_agent
data_files:
- split: train
path: cascade_terminal_agent/train-*
- config_name: nemotron_rl
data_files:
- split: instruction_following
path: nemotron_rl/instruction_following-*
- config_name: nemotron_sft_chat
data_files:
- split: chat
path: nemotron_sft_chat/chat-*
- config_name: nemotron_sft_code
data_files:
- split: code
path: nemotron_sft_code/code-*
- config_name: nemotron_sft_math
data_files:
- split: math
path: nemotron_sft_math/math-*
- config_name: nemotron_sft_safety
data_files:
- split: safety
path: nemotron_sft_safety/safety-*
- config_name: nemotron_sft_science
data_files:
- split: science
path: nemotron_sft_science/science-*
dataset_info:
- config_name: cascade_conversational_agent
features:
- name: domain
dtype: string
- name: source
dtype: string
- name: messages
list:
- name: role
dtype: string
- name: content
dtype: string
- name: generator
dtype: string
splits:
- name: train
num_bytes: 16610848715
num_examples: 822213
download_size: 16572224011
dataset_size: 16610848715
- config_name: cascade_instruction_following
features:
- name: domain
dtype: string
- name: source
dtype: string
- name: messages
list:
- name: role
dtype: string
- name: content
dtype: string
- name: generator
dtype: string
splits:
- name: train
num_bytes: 3291121703
num_examples: 820263
download_size: 3235443211
dataset_size: 3291121703
- config_name: cascade_math
features:
- name: domain
dtype: string
- name: source
dtype: string
- name: messages
list:
- name: role
dtype: string
- name: content
dtype: string
- name: generator
dtype: string
splits:
- name: train
num_bytes: 234238494601
num_examples: 5226364
download_size: 233946106949
dataset_size: 234238494601
- config_name: cascade_safety
features:
- name: domain
dtype: string
- name: source
dtype: string
- name: messages
list:
- name: role
dtype: string
- name: content
dtype: string
- name: generator
dtype: string
splits:
- name: train
num_bytes: 14072047
num_examples: 3570
download_size: 13940007
dataset_size: 14072047
- config_name: cascade_science
features:
- name: domain
dtype: string
- name: source
dtype: string
- name: messages
list:
- name: role
dtype: string
- name: content
dtype: string
- name: generator
dtype: string
splits:
- name: train
num_bytes: 44633958496
num_examples: 2717163
download_size: 44500162446
dataset_size: 44633958496
- config_name: cascade_swe
features:
- name: domain
dtype: string
- name: source
dtype: string
- name: messages
list:
- name: role
dtype: string
- name: content
dtype: string
- name: generator
dtype: string
splits:
- name: train
num_bytes: 35010223613
num_examples: 439610
download_size: 34984642651
dataset_size: 35010223613
- config_name: cascade_terminal_agent
features:
- name: domain
dtype: string
- name: source
dtype: string
- name: messages
list:
- name: role
dtype: string
- name: content
dtype: string
- name: generator
dtype: string
splits:
- name: train
num_bytes: 29404143854
num_examples: 485667
download_size: 29376838676
dataset_size: 29404143854
- config_name: nemotron_rl
features:
- name: input
list:
- name: role
dtype: string
- name: content
dtype: string
- name: args
struct:
- name: instruction_id_list
list: string
- name: instruction_kwargs
list: json
- name: task
dtype: string
- name: num_requirements
dtype: int64
- name: category
dtype: string
- name: license
dtype: string
- name: reasoning
dtype: string
- name: used_in_training
dtype: string
- name: version
dtype: string
- name: system_prompt
dtype: string
splits:
- name: instruction_following
num_bytes: 164592594
num_examples: 56339
download_size: 158974946
dataset_size: 164592594
- config_name: nemotron_sft_chat
features:
- name: input
list:
- name: role
dtype: string
- name: content
dtype: string
- name: output
dtype: string
- name: category
dtype: string
- name: license
dtype: string
- name: reasoning
dtype: string
- name: generator
dtype: string
- name: used_in_training
dtype: string
- name: version
dtype: string
- name: system_prompt
dtype: string
splits:
- name: chat
num_bytes: 245046303
num_examples: 39792
download_size: 169828290
dataset_size: 245046303
- config_name: nemotron_sft_code
features:
- name: input
list:
- name: role
dtype: string
- name: content
dtype: string
- name: output
dtype: string
- name: category
dtype: string
- name: license
dtype: string
- name: reasoning
dtype: string
- name: generator
dtype: string
- name: used_in_training
dtype: string
- name: version
dtype: string
- name: system_prompt
dtype: string
splits:
- name: code
num_bytes: 45865777355
num_examples: 10108883
download_size: 23565003450
dataset_size: 45865777355
- config_name: nemotron_sft_math
features:
- name: input
list:
- name: role
dtype: string
- name: content
dtype: string
- name: output
dtype: string
- name: category
dtype: string
- name: license
dtype: string
- name: reasoning
dtype: string
- name: generator
dtype: string
- name: used_in_training
dtype: string
- name: version
dtype: string
- name: system_prompt
dtype: string
splits:
- name: math
num_bytes: 70454610238
num_examples: 22066397
download_size: 33049334526
dataset_size: 70454610238
- config_name: nemotron_sft_safety
features:
- name: input
list:
- name: role
dtype: string
- name: content
dtype: string
- name: output
dtype: string
- name: category
dtype: string
- name: license
dtype: string
- name: reasoning
dtype: string
- name: generator
dtype: string
- name: used_in_training
dtype: string
- name: version
dtype: string
- name: system_prompt
dtype: string
splits:
- name: safety
num_bytes: 53022448
num_examples: 31426
download_size: 26165302
dataset_size: 53022448
- config_name: nemotron_sft_science
features:
- name: input
list:
- name: role
dtype: string
- name: content
dtype: string
- name: output
dtype: string
- name: category
dtype: string
- name: license
dtype: string
- name: reasoning
dtype: string
- name: generator
dtype: string
- name: used_in_training
dtype: string
- name: version
dtype: string
- name: system_prompt
dtype: string
splits:
- name: science
num_bytes: 5858893209
num_examples: 708920
download_size: 2936806260
dataset_size: 5858893209
task_categories:
- text-generation
language:
- en
tags:
- zenyx
- sft
- instruction-following
- math
- code
- science
- reasoning
- agent
pretty_name: Zenyx V2 SFT Raw Dataset Collection
size_categories:
- 100M<n<1B
---
# Zenyx V2 — Raw SFT Dataset Collection
This is the unified raw dataset collection used for training **Zenyx V2**,
a custom large language model built from scratch with a novel architecture.
## Dataset Sources
| Dataset | Rows | Category |
|---------|------|----------|
| nemotron_sft_code | 10,108,883 | Code |
| nemotron_sft_math | 22,066,397 | Math |
| nemotron_sft_science | 708,920 | Science |
| nemotron_sft_chat | 39,792 | Chat |
| nemotron_sft_safety | 31,426 | Safety |
| nemotron_rl | 56,339 | Instruction Following (RL) |
| cascade_math | 5,226,364 | Math (Cascade) |
| cascade_science | 2,717,163 | Science (Cascade) |
| cascade_instruction_following | 820,263 | Instruction Following |
| cascade_safety | 3,570 | Safety |
| cascade_conversational_agent | 822,213 | Conversational Agent |
| cascade_swe | 439,610 | Software Engineering |
| cascade_terminal_agent | 485,667 | Terminal Agent |
| redmod_math | 143,055,882 | Math (Thinking + Non-Thinking) |
## Column Schemas
**Nemotron SFT** — `input`, `output`, `category`, `license`,
`reasoning`, `generator`, `used_in_training`, `version`, `system_prompt`
**Nemotron Cascade** — `domain`, `source`, `messages`, `generator`
**RedMod Math** — `text`
## About Zenyx V2
Zenyx is a custom LLM built with a novel architecture featuring:
- Custom tokenizer
- Modified attention mechanism
- Trained entirely on curated open-source data
> This dataset is for research purposes. All source datasets retain
> their original licenses.
## Missing (Next Session)
- `cascade_chat` (~200GB, download interrupted)
- `openO1` (corrupt JSON issue, fix pending)
- `stepfun_sft` (OOM issue, fix pending)
提供机构:
Arko007



