nvidia/Nemotron-Pretraining-Specialized-v1

Name: nvidia/Nemotron-Pretraining-Specialized-v1
Creator: nvidia
Published: 2025-12-22 17:17:17
License: 暂无描述

Hugging Face2025-12-22 更新2025-12-20 收录

下载链接：

https://hf-mirror.com/datasets/nvidia/Nemotron-Pretraining-Specialized-v1

下载链接

链接失效反馈

官方服务：

资源简介：

Nemotron-Pre-Training-Dataset-v2.1 是对先前发布的 Nemotron 预训练数据集的扩展，包含了数学、代码、英语 Common Crawl 和大规模合成语料库的更新、更高质量和更多样化的数据。该数据集专为 NVIDIA Nemotron 3 系列大型语言模型设计，引入了新的 Common Crawl 代码提取、2.5T 新的英语网页标记、更新的 GitHub 来源的源代码语料库以及专门的 STEM 推理数据集。这些新增内容旨在与现有的 Nemotron 预训练数据集一起使用，而不是替代它们，为训练领先的大型语言模型提供了扩展的现代化基础。数据集分为四个主要类别：Nemotron-CC-Code-v1（高质量代码预训练数据集）、Nemotron-CC-v2.1（通用英语网页数据）、Nemotron-Pretraining-Code-v2（源代码语料库更新和扩展）和 Nemotron-Pretraining-Specialized-v1（STEM 推理和科学编码等专业领域的合成数据集）。该数据集可用于商业用途。

The Nemotron-Pre-Training-Dataset-v2.1 extends the previously released Nemotron pretraining datasets with refreshed, higher-quality, and more diverse data across math, code, English Common Crawl, and large-scale synthetic corpora. Designed for the NVIDIA Nemotron 3 family of LLMs, the dataset introduces new Common Crawl code extraction, 2.5T new English web tokens, updated GitHub-sourced source-code corpora, and specialized STEM reasoning datasets. These additions are intended to be used together with, not as replacements for, existing Nemotron Pretraining datasets, providing an expanded, modern foundation for training leading LLMs. The dataset comes in 4 main categories: Nemotron-CC-Code-v1 (a high-quality Code pretraining dataset), Nemotron-CC-v2.1 (general English web data), Nemotron-Pretraining-Code-v2 (an update to and expansion of the source-code corpus), and Nemotron-Pretraining-Specialized-v1 (a collection of synthetic datasets for specialized areas like STEM reasoning and scientific coding). This dataset is ready for commercial use.

提供机构：

nvidia

5,000+

优质数据集

54 个

任务类型

进入经典数据集