five

kshitijthakkar/nemotron-balanced-1b-v2

收藏
Hugging Face2026-03-20 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/kshitijthakkar/nemotron-balanced-1b-v2
下载链接
链接失效反馈
官方服务:
资源简介:
# Nemotron Balanced 1B Token Dataset ## Overview This dataset is a balanced 1 billion token subset of NVIDIA's Nemotron-Pretraining-Specialized-v1 and v1.1 datasets. ## Statistics - **Total Samples**: 1,299,680 - **Total Tokens**: 999,999,950 - **Tokenizer**: Qwen/Qwen3-0.6B - **Random Seed**: 42 ## Subset Distribution | Subset | Samples | Tokens | Target | Completion | |--------|---------|--------|--------|------------| | Nemotron-Pretraining-Math-Textbooks | 74,593 | 150,000,000 | 150,000,000 | 100.0% | | Nemotron-Pretraining-STEM-SFT | 23,755 | 120,000,000 | 120,000,000 | 100.0% | | Nemotron-Pretraining-Scientific-Coding | 117,174 | 149,999,966 | 150,000,000 | 100.0% | | Nemotron-Pretraining-Wiki-Rewrite | 102,778 | 120,000,000 | 120,000,000 | 100.0% | | Nemotron-Pretraining-RQA | 13,406 | 100,000,000 | 100,000,000 | 100.0% | | Nemotron-Pretraining-InfiniByte-Reasoning | 6,316 | 110,000,000 | 110,000,000 | 100.0% | | Nemotron-Pretraining-Formal-Logic | 411,082 | 60,000,000 | 60,000,000 | 100.0% | | Nemotron-Pretraining-Economics | 295,321 | 39,999,984 | 40,000,000 | 100.0% | | Nemotron-Pretraining-Multiple-Choice | 113,592 | 50,000,000 | 50,000,000 | 100.0% | | Nemotron-Pretraining-Unconditional-Algorithmic | 33,069 | 50,000,000 | 50,000,000 | 100.0% | | Nemotron-Pretraining-Code-Concepts | 108,594 | 50,000,000 | 50,000,000 | 100.0% | ## Usage ```python from datasets import load_from_disk # Load the dataset dataset = load_from_disk("./data/nemotron_balanced_1b_v2/hf_dataset") # Or load from parquet import pandas as pd df = pd.read_parquet("./data/nemotron_balanced_1b_v2/dataset.parquet") ``` ## Dataset Fields - **text**: The raw text content - **formatted_text**: The formatted text (same as text for pretraining data) - **encoded_text**: Tokenized version of the text (list of token IDs) - **source**: Source subset name - **token_count**: Number of tokens in the sample ## License Follows the licensing of NVIDIA Nemotron-Pretraining-Specialized-v1 and v1.1: - Most subsets: CC BY 4.0 - Wiki-Rewrite and Scientific-Coding (v1): CC BY-SA 4.0 and GFDL ## Citation ```bibtex @misc{nemotron-balanced-1b, title={Nemotron Balanced 1B Token Dataset}, author={Created from NVIDIA Nemotron-Pretraining-Specialized-v1}, year={2024}, note={Tokenized with Qwen/Qwen3-0.6B} } ```
提供机构:
kshitijthakkar
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作