five

albertlungu/final-nous-corpus

收藏
Hugging Face2026-03-26 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/albertlungu/final-nous-corpus
下载链接
链接失效反馈
官方服务:
资源简介:
# Nous Training Corpus ## Dataset Description This dataset contains the pre-training corpus for the Nous multimodal model. - **Total tokens:** 170,756,572,334 (~170.8B) - **Uncompressed size:** 651.4 GB - **Tokenizer:** TikToken cl100k_base - **Format:** Plain text, double-newline separated documents ## Data Sources Mix of high-quality text data: - FineWeb-Edu (80B tokens) - OpenWebMath (30B tokens) - DCLM-Baseline (30B tokens) - OpenMathReasoning (20B tokens) - OpenR1-Math (20B tokens) - ML-ArXiv (20B tokens) - Wikipedia (15B tokens) - PG-19 Books (15B tokens) - GSM8K Enhanced (15B tokens) - The Stack (10B tokens) ## Usage ### Streaming (Recommended for 32GB storage) ```python from datasets import load_dataset dataset = load_dataset("albertlungu/final-nous-corpus", split="train", streaming=True) for example in dataset: text = example["text"] # Tokenize and train... ``` ### With Streaming Dataloader ```python from src.data.streaming_dataset import create_dataloader dataloader = create_dataloader( repo_id="albertlungu/final-nous-corpus", batch_size=8, seq_length=4096, rank=0, world_size=4, ) for batch in dataloader: input_ids = batch["input_ids"] # [8, 4096] labels = batch["labels"] # Train... ``` ## License This dataset is a compilation of publicly available sources. Each component retains its original license. ## Citation If you use this dataset, please cite the original sources.
提供机构:
albertlungu
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作