five

MaziyarPanahi/Nemotron-Cascade-2-SFT-Data-Small

收藏
Hugging Face2026-03-22 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/MaziyarPanahi/Nemotron-Cascade-2-SFT-Data-Small
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: other license_name: nvidia-open-model-license license_link: https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/ language: - en tags: - sft - instruction-tuning - math - science - chat - safety - code - agent dataset_info: features: - name: domain dtype: string - name: source dtype: string - name: prompt list: - name: content dtype: string - name: role dtype: string - name: completion list: - name: content dtype: string - name: role dtype: string splits: - name: train num_bytes: 111140015782 num_examples: 4898804 download_size: 54155279870 dataset_size: 111140015782 configs: - config_name: default data_files: - split: train path: data/train-* --- # Nemotron-Cascade-2-SFT-Data-Small A **20% random sample** of [nvidia/Nemotron-Cascade-2-SFT-Data](https://huggingface.co/datasets/nvidia/Nemotron-Cascade-2-SFT-Data), merged into a single `train` split with **4,898,804 rows**. ## Subsets included (all merged) | Original subset | Files sampled | ~Rows sampled | |---|---|---| | math | math_notool, math_proof, math_tool | ~1,045,266 | | science | science | ~544,383 | | chat | chat_part_1 – chat_part_4 | ~2,794,866 | | instruction_following | instruction_following | ~163,869 | | safety | safety | ~693 | | conversational_agent | conversational_agent | ~164,264 | | swe | swe_agentic, swe_agentless | ~88,174 | | terminal_agent | terminal_agent | ~97,289 | ## Schema ```python { "domain": str, # e.g. "math_notool", "chat", "swe_agentic" "source": str, # upstream data source "messages": list[{"role": str, "content": str}], "generator": str, # model that generated the response } ``` ## Usage ```python from datasets import load_dataset ds = load_dataset("MaziyarPanahi/Nemotron-Cascade-2-SFT-Data-Small", split="train") ``` ## Sampling details - Sample rate: 20% Bernoulli per source file - Random seed: 42 - Output format: Parquet (zstd compressed, 500K rows/shard, 10 shards, ~35 GB)
提供机构:
MaziyarPanahi
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作