albertlungu/final-nous-corpus
收藏Hugging Face2026-03-26 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/albertlungu/final-nous-corpus
下载链接
链接失效反馈官方服务:
资源简介:
# Nous Training Corpus
## Dataset Description
This dataset contains the pre-training corpus for the Nous multimodal model.
- **Total tokens:** 170,756,572,334 (~170.8B)
- **Uncompressed size:** 651.4 GB
- **Tokenizer:** TikToken cl100k_base
- **Format:** Plain text, double-newline separated documents
## Data Sources
Mix of high-quality text data:
- FineWeb-Edu (80B tokens)
- OpenWebMath (30B tokens)
- DCLM-Baseline (30B tokens)
- OpenMathReasoning (20B tokens)
- OpenR1-Math (20B tokens)
- ML-ArXiv (20B tokens)
- Wikipedia (15B tokens)
- PG-19 Books (15B tokens)
- GSM8K Enhanced (15B tokens)
- The Stack (10B tokens)
## Usage
### Streaming (Recommended for 32GB storage)
```python
from datasets import load_dataset
dataset = load_dataset("albertlungu/final-nous-corpus", split="train", streaming=True)
for example in dataset:
text = example["text"]
# Tokenize and train...
```
### With Streaming Dataloader
```python
from src.data.streaming_dataset import create_dataloader
dataloader = create_dataloader(
repo_id="albertlungu/final-nous-corpus",
batch_size=8,
seq_length=4096,
rank=0,
world_size=4,
)
for batch in dataloader:
input_ids = batch["input_ids"] # [8, 4096]
labels = batch["labels"]
# Train...
```
## License
This dataset is a compilation of publicly available sources. Each component retains its original license.
## Citation
If you use this dataset, please cite the original sources.
提供机构:
albertlungu



