kshitijthakkar/nemotron-sft-balanced-2b-v1
收藏Hugging Face2026-02-24 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/kshitijthakkar/nemotron-sft-balanced-2b-v1
下载链接
链接失效反馈官方服务:
资源简介:
# Nemotron SFT Dataset
## Overview
This dataset is a curated supervised fine-tuning (SFT) dataset built from NVIDIA's Nemotron-Cascade-SFT-Stage-1 and Stage-2 datasets.
## Statistics
- **Total Samples**: 200,000
- **Total Tokens**: 1,252,287,904
- **Average Tokens per Sample**: 6261.4
- **Tokenizer**: Qwen/Qwen3-0.6B
- **Random Seed**: 42
- **Strategy**: balanced
## Subset Distribution
| Subset | Samples | Tokens | Target | Completion | Avg Tokens/Sample |
|--------|---------|--------|--------|------------|-------------------|
| Stage-1/math | 20,000 | 151,546,125 | 20,000 | 100.0% | 7577.3 |
| Stage-1/code | 20,000 | 149,875,360 | 20,000 | 100.0% | 7493.8 |
| Stage-1/science | 20,000 | 88,176,211 | 20,000 | 100.0% | 4408.8 |
| Stage-1/general | 20,000 | 30,166,102 | 20,000 | 100.0% | 1508.3 |
| Stage-2/math | 16,000 | 245,808,608 | 16,000 | 100.0% | 15363.0 |
| Stage-2/code | 16,000 | 196,814,995 | 16,000 | 100.0% | 12300.9 |
| Stage-2/science | 16,000 | 137,249,291 | 16,000 | 100.0% | 8578.1 |
| Stage-2/general | 16,000 | 24,844,951 | 16,000 | 100.0% | 1552.8 |
| Stage-2/tool_calling | 20,000 | 99,972,206 | 20,000 | 100.0% | 4998.6 |
| Stage-2/instruction-following | 20,000 | 20,282,455 | 20,000 | 100.0% | 1014.1 |
| Stage-2/swe_repair | 6,000 | 56,577,614 | 6,000 | 100.0% | 9429.6 |
| Stage-2/swe_localization | 6,000 | 31,229,042 | 6,000 | 100.0% | 5204.8 |
| Stage-2/swe_testgen | 4,000 | 19,744,944 | 4,000 | 100.0% | 4936.2 |
## Available Strategies
- **balanced**: Balanced mix across all subsets (40% Stage-1, 60% Stage-2)
- **math-focused**: Emphasize math from both stages
- **code-focused**: Emphasize code and SWE tasks
- **general-focused**: Emphasize general instruction following
- **stage1-only**: Only use Stage-1 subsets
- **stage2-only**: Only use Stage-2 subsets
- **advanced**: Stage-2 heavy with tool calling and SWE tasks
## Usage
### Loading the Dataset
```python
from datasets import load_from_disk
# Load the dataset
dataset = load_from_disk("./nemotron_sft_dataset/hf_dataset")
# Or load from parquet
import pandas as pd
df = pd.read_parquet("./nemotron_sft_dataset/dataset.parquet")
```
### Training Example
```python
from datasets import load_from_disk
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer
# Load dataset
dataset = load_from_disk("./nemotron_sft_dataset/hf_dataset")
# Load model and tokenizer
model_name = "your-base-model"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
# Training arguments
training_args = TrainingArguments(
output_dir="./sft_output",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-5,
warmup_steps=100,
logging_steps=10,
save_steps=500,
)
# Train
trainer = Trainer(
model=model,
args=training_args,
train_dataset=dataset,
)
trainer.train()
```
## Dataset Fields
- **text**: The raw formatted conversation text
- **formatted_text**: The formatted text (same as text for SFT data)
- **encoded_text**: Tokenized version of the text (list of token IDs)
- **source**: Source subset name (e.g., "Stage-1/math", "Stage-2/tool_calling")
- **dataset_id**: Original Hugging Face dataset ID
- **config_name**: Configuration name within the dataset
- **stage**: Stage number (1 or 2)
- **token_count**: Number of tokens in the sample
## Strategy Used
**balanced**: Balanced mix across all subsets (40% Stage-1, 60% Stage-2) for general-purpose fine-tuning
## License
Follows the licensing of NVIDIA Nemotron-Cascade-SFT datasets.
Please refer to the original dataset pages for detailed licensing information:
- [Nemotron-Cascade-SFT-Stage-1](https://huggingface.co/datasets/nvidia/Nemotron-Cascade-SFT-Stage-1)
- [Nemotron-Cascade-SFT-Stage-2](https://huggingface.co/datasets/nvidia/Nemotron-Cascade-SFT-Stage-2)
## Citation
```bibtex
@misc{nemotron-sft-dataset,
title={Nemotron SFT Dataset},
author={Created from NVIDIA Nemotron-Cascade-SFT-Stage-1 and Stage-2},
year={2024},
note={Tokenized with Qwen/Qwen3-0.6B}
}
```
提供机构:
kshitijthakkar



