MaziyarPanahi/Nemotron-Cascade-2-SFT-Data-Small
收藏Hugging Face2026-03-22 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/MaziyarPanahi/Nemotron-Cascade-2-SFT-Data-Small
下载链接
链接失效反馈官方服务:
资源简介:
---
license: other
license_name: nvidia-open-model-license
license_link: https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/
language:
- en
tags:
- sft
- instruction-tuning
- math
- science
- chat
- safety
- code
- agent
dataset_info:
features:
- name: domain
dtype: string
- name: source
dtype: string
- name: prompt
list:
- name: content
dtype: string
- name: role
dtype: string
- name: completion
list:
- name: content
dtype: string
- name: role
dtype: string
splits:
- name: train
num_bytes: 111140015782
num_examples: 4898804
download_size: 54155279870
dataset_size: 111140015782
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
---
# Nemotron-Cascade-2-SFT-Data-Small
A **20% random sample** of [nvidia/Nemotron-Cascade-2-SFT-Data](https://huggingface.co/datasets/nvidia/Nemotron-Cascade-2-SFT-Data), merged into a single `train` split with **4,898,804 rows**.
## Subsets included (all merged)
| Original subset | Files sampled | ~Rows sampled |
|---|---|---|
| math | math_notool, math_proof, math_tool | ~1,045,266 |
| science | science | ~544,383 |
| chat | chat_part_1 – chat_part_4 | ~2,794,866 |
| instruction_following | instruction_following | ~163,869 |
| safety | safety | ~693 |
| conversational_agent | conversational_agent | ~164,264 |
| swe | swe_agentic, swe_agentless | ~88,174 |
| terminal_agent | terminal_agent | ~97,289 |
## Schema
```python
{
"domain": str, # e.g. "math_notool", "chat", "swe_agentic"
"source": str, # upstream data source
"messages": list[{"role": str, "content": str}],
"generator": str, # model that generated the response
}
```
## Usage
```python
from datasets import load_dataset
ds = load_dataset("MaziyarPanahi/Nemotron-Cascade-2-SFT-Data-Small", split="train")
```
## Sampling details
- Sample rate: 20% Bernoulli per source file
- Random seed: 42
- Output format: Parquet (zstd compressed, 500K rows/shard, 10 shards, ~35 GB)
提供机构:
MaziyarPanahi



