kshitijthakkar/nemotron-balanced-1b-v2
收藏Hugging Face2026-03-20 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/kshitijthakkar/nemotron-balanced-1b-v2
下载链接
链接失效反馈官方服务:
资源简介:
# Nemotron Balanced 1B Token Dataset
## Overview
This dataset is a balanced 1 billion token subset of NVIDIA's Nemotron-Pretraining-Specialized-v1 and v1.1 datasets.
## Statistics
- **Total Samples**: 1,299,680
- **Total Tokens**: 999,999,950
- **Tokenizer**: Qwen/Qwen3-0.6B
- **Random Seed**: 42
## Subset Distribution
| Subset | Samples | Tokens | Target | Completion |
|--------|---------|--------|--------|------------|
| Nemotron-Pretraining-Math-Textbooks | 74,593 | 150,000,000 | 150,000,000 | 100.0% |
| Nemotron-Pretraining-STEM-SFT | 23,755 | 120,000,000 | 120,000,000 | 100.0% |
| Nemotron-Pretraining-Scientific-Coding | 117,174 | 149,999,966 | 150,000,000 | 100.0% |
| Nemotron-Pretraining-Wiki-Rewrite | 102,778 | 120,000,000 | 120,000,000 | 100.0% |
| Nemotron-Pretraining-RQA | 13,406 | 100,000,000 | 100,000,000 | 100.0% |
| Nemotron-Pretraining-InfiniByte-Reasoning | 6,316 | 110,000,000 | 110,000,000 | 100.0% |
| Nemotron-Pretraining-Formal-Logic | 411,082 | 60,000,000 | 60,000,000 | 100.0% |
| Nemotron-Pretraining-Economics | 295,321 | 39,999,984 | 40,000,000 | 100.0% |
| Nemotron-Pretraining-Multiple-Choice | 113,592 | 50,000,000 | 50,000,000 | 100.0% |
| Nemotron-Pretraining-Unconditional-Algorithmic | 33,069 | 50,000,000 | 50,000,000 | 100.0% |
| Nemotron-Pretraining-Code-Concepts | 108,594 | 50,000,000 | 50,000,000 | 100.0% |
## Usage
```python
from datasets import load_from_disk
# Load the dataset
dataset = load_from_disk("./data/nemotron_balanced_1b_v2/hf_dataset")
# Or load from parquet
import pandas as pd
df = pd.read_parquet("./data/nemotron_balanced_1b_v2/dataset.parquet")
```
## Dataset Fields
- **text**: The raw text content
- **formatted_text**: The formatted text (same as text for pretraining data)
- **encoded_text**: Tokenized version of the text (list of token IDs)
- **source**: Source subset name
- **token_count**: Number of tokens in the sample
## License
Follows the licensing of NVIDIA Nemotron-Pretraining-Specialized-v1 and v1.1:
- Most subsets: CC BY 4.0
- Wiki-Rewrite and Scientific-Coding (v1): CC BY-SA 4.0 and GFDL
## Citation
```bibtex
@misc{nemotron-balanced-1b,
title={Nemotron Balanced 1B Token Dataset},
author={Created from NVIDIA Nemotron-Pretraining-Specialized-v1},
year={2024},
note={Tokenized with Qwen/Qwen3-0.6B}
}
```
提供机构:
kshitijthakkar



