kshitijthakkar/nemotron-balanced-1b-v2

Name: kshitijthakkar/nemotron-balanced-1b-v2
Creator: kshitijthakkar
Published: 2026-03-20 13:07:40
License: 暂无描述

Hugging Face2026-03-20 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/kshitijthakkar/nemotron-balanced-1b-v2

下载链接

链接失效反馈

官方服务：

资源简介：

# Nemotron Balanced 1B Token Dataset ## Overview This dataset is a balanced 1 billion token subset of NVIDIA's Nemotron-Pretraining-Specialized-v1 and v1.1 datasets. ## Statistics - **Total Samples**: 1,299,680 - **Total Tokens**: 999,999,950 - **Tokenizer**: Qwen/Qwen3-0.6B - **Random Seed**: 42 ## Subset Distribution | Subset | Samples | Tokens | Target | Completion | |--------|---------|--------|--------|------------| | Nemotron-Pretraining-Math-Textbooks | 74,593 | 150,000,000 | 150,000,000 | 100.0% | | Nemotron-Pretraining-STEM-SFT | 23,755 | 120,000,000 | 120,000,000 | 100.0% | | Nemotron-Pretraining-Scientific-Coding | 117,174 | 149,999,966 | 150,000,000 | 100.0% | | Nemotron-Pretraining-Wiki-Rewrite | 102,778 | 120,000,000 | 120,000,000 | 100.0% | | Nemotron-Pretraining-RQA | 13,406 | 100,000,000 | 100,000,000 | 100.0% | | Nemotron-Pretraining-InfiniByte-Reasoning | 6,316 | 110,000,000 | 110,000,000 | 100.0% | | Nemotron-Pretraining-Formal-Logic | 411,082 | 60,000,000 | 60,000,000 | 100.0% | | Nemotron-Pretraining-Economics | 295,321 | 39,999,984 | 40,000,000 | 100.0% | | Nemotron-Pretraining-Multiple-Choice | 113,592 | 50,000,000 | 50,000,000 | 100.0% | | Nemotron-Pretraining-Unconditional-Algorithmic | 33,069 | 50,000,000 | 50,000,000 | 100.0% | | Nemotron-Pretraining-Code-Concepts | 108,594 | 50,000,000 | 50,000,000 | 100.0% | ## Usage ```python from datasets import load_from_disk # Load the dataset dataset = load_from_disk("./data/nemotron_balanced_1b_v2/hf_dataset") # Or load from parquet import pandas as pd df = pd.read_parquet("./data/nemotron_balanced_1b_v2/dataset.parquet") ``` ## Dataset Fields - **text**: The raw text content - **formatted_text**: The formatted text (same as text for pretraining data) - **encoded_text**: Tokenized version of the text (list of token IDs) - **source**: Source subset name - **token_count**: Number of tokens in the sample ## License Follows the licensing of NVIDIA Nemotron-Pretraining-Specialized-v1 and v1.1: - Most subsets: CC BY 4.0 - Wiki-Rewrite and Scientific-Coding (v1): CC BY-SA 4.0 and GFDL ## Citation ```bibtex @misc{nemotron-balanced-1b, title={Nemotron Balanced 1B Token Dataset}, author={Created from NVIDIA Nemotron-Pretraining-Specialized-v1}, year={2024}, note={Tokenized with Qwen/Qwen3-0.6B} } ```

提供机构：

kshitijthakkar

5,000+

优质数据集

54 个

任务类型

进入经典数据集