five

kshitijthakkar/nemotron-sft-balanced-2b-v1

收藏
Hugging Face2026-02-24 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/kshitijthakkar/nemotron-sft-balanced-2b-v1
下载链接
链接失效反馈
官方服务:
资源简介:
# Nemotron SFT Dataset ## Overview This dataset is a curated supervised fine-tuning (SFT) dataset built from NVIDIA's Nemotron-Cascade-SFT-Stage-1 and Stage-2 datasets. ## Statistics - **Total Samples**: 200,000 - **Total Tokens**: 1,252,287,904 - **Average Tokens per Sample**: 6261.4 - **Tokenizer**: Qwen/Qwen3-0.6B - **Random Seed**: 42 - **Strategy**: balanced ## Subset Distribution | Subset | Samples | Tokens | Target | Completion | Avg Tokens/Sample | |--------|---------|--------|--------|------------|-------------------| | Stage-1/math | 20,000 | 151,546,125 | 20,000 | 100.0% | 7577.3 | | Stage-1/code | 20,000 | 149,875,360 | 20,000 | 100.0% | 7493.8 | | Stage-1/science | 20,000 | 88,176,211 | 20,000 | 100.0% | 4408.8 | | Stage-1/general | 20,000 | 30,166,102 | 20,000 | 100.0% | 1508.3 | | Stage-2/math | 16,000 | 245,808,608 | 16,000 | 100.0% | 15363.0 | | Stage-2/code | 16,000 | 196,814,995 | 16,000 | 100.0% | 12300.9 | | Stage-2/science | 16,000 | 137,249,291 | 16,000 | 100.0% | 8578.1 | | Stage-2/general | 16,000 | 24,844,951 | 16,000 | 100.0% | 1552.8 | | Stage-2/tool_calling | 20,000 | 99,972,206 | 20,000 | 100.0% | 4998.6 | | Stage-2/instruction-following | 20,000 | 20,282,455 | 20,000 | 100.0% | 1014.1 | | Stage-2/swe_repair | 6,000 | 56,577,614 | 6,000 | 100.0% | 9429.6 | | Stage-2/swe_localization | 6,000 | 31,229,042 | 6,000 | 100.0% | 5204.8 | | Stage-2/swe_testgen | 4,000 | 19,744,944 | 4,000 | 100.0% | 4936.2 | ## Available Strategies - **balanced**: Balanced mix across all subsets (40% Stage-1, 60% Stage-2) - **math-focused**: Emphasize math from both stages - **code-focused**: Emphasize code and SWE tasks - **general-focused**: Emphasize general instruction following - **stage1-only**: Only use Stage-1 subsets - **stage2-only**: Only use Stage-2 subsets - **advanced**: Stage-2 heavy with tool calling and SWE tasks ## Usage ### Loading the Dataset ```python from datasets import load_from_disk # Load the dataset dataset = load_from_disk("./nemotron_sft_dataset/hf_dataset") # Or load from parquet import pandas as pd df = pd.read_parquet("./nemotron_sft_dataset/dataset.parquet") ``` ### Training Example ```python from datasets import load_from_disk from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer # Load dataset dataset = load_from_disk("./nemotron_sft_dataset/hf_dataset") # Load model and tokenizer model_name = "your-base-model" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained(model_name) # Training arguments training_args = TrainingArguments( output_dir="./sft_output", num_train_epochs=3, per_device_train_batch_size=4, gradient_accumulation_steps=4, learning_rate=2e-5, warmup_steps=100, logging_steps=10, save_steps=500, ) # Train trainer = Trainer( model=model, args=training_args, train_dataset=dataset, ) trainer.train() ``` ## Dataset Fields - **text**: The raw formatted conversation text - **formatted_text**: The formatted text (same as text for SFT data) - **encoded_text**: Tokenized version of the text (list of token IDs) - **source**: Source subset name (e.g., "Stage-1/math", "Stage-2/tool_calling") - **dataset_id**: Original Hugging Face dataset ID - **config_name**: Configuration name within the dataset - **stage**: Stage number (1 or 2) - **token_count**: Number of tokens in the sample ## Strategy Used **balanced**: Balanced mix across all subsets (40% Stage-1, 60% Stage-2) for general-purpose fine-tuning ## License Follows the licensing of NVIDIA Nemotron-Cascade-SFT datasets. Please refer to the original dataset pages for detailed licensing information: - [Nemotron-Cascade-SFT-Stage-1](https://huggingface.co/datasets/nvidia/Nemotron-Cascade-SFT-Stage-1) - [Nemotron-Cascade-SFT-Stage-2](https://huggingface.co/datasets/nvidia/Nemotron-Cascade-SFT-Stage-2) ## Citation ```bibtex @misc{nemotron-sft-dataset, title={Nemotron SFT Dataset}, author={Created from NVIDIA Nemotron-Cascade-SFT-Stage-1 and Stage-2}, year={2024}, note={Tokenized with Qwen/Qwen3-0.6B} } ```
提供机构:
kshitijthakkar
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作