semran1/Nemotron-Pretraining-Specialized-v1

Name: semran1/Nemotron-Pretraining-Specialized-v1
Creator: semran1
Published: 2025-12-16 00:17:45
License: 暂无描述

Hugging Face2025-12-16 更新2025-12-20 收录

下载链接：

https://hf-mirror.com/datasets/semran1/Nemotron-Pretraining-Specialized-v1

下载链接

链接失效反馈

官方服务：

资源简介：

Nemotron-Pre-Training-Dataset-v2.1是对之前发布的Nemotron预训练数据集的扩展，包含了更新、更高质量和更多样化的数学、代码、英语Common Crawl以及大规模合成语料库数据。该数据集专为NVIDIA Nemotron 3系列大型语言模型设计，引入了新的Common Crawl代码提取、2.5T新的英语网络令牌、更新的GitHub来源的源代码语料库以及专门的STEM推理数据集。这些新增内容旨在与现有的Nemotron预训练数据集一起使用，而非替代，为训练领先的大型语言模型提供了一个扩展的现代基础。数据集分为四个主要类别：Nemotron-CC-Code-v1、Nemotron-CC-v2.1、Nemotron-Pretraining-Code-v2和Nemotron-Pretraining-Specialized-v1，每个类别都有特定的用例和数据特性。该数据集已准备好用于商业用途，并为不同子集提供了特定的许可条款。

The Nemotron-Pre-Training-Dataset-v2.1 extends the previously released Nemotron pretraining datasets with refreshed, higher-quality, and more diverse data across math, code, English Common Crawl, and large-scale synthetic corpora. Designed for the NVIDIA Nemotron 3 family of LLMs, the dataset introduces new Common Crawl code extraction, 2.5T new English web tokens, updated GitHub-sourced source-code corpora, and specialized STEM reasoning datasets. These additions are intended to be used together with, not as replacements for, existing Nemotron Pretraining datasets, providing an expanded, modern foundation for training leading LLMs. The dataset comes in four main categories: Nemotron-CC-Code-v1, Nemotron-CC-v2.1, Nemotron-Pretraining-Code-v2, and Nemotron-Pretraining-Specialized-v1, each with specific use cases and data characteristics. This dataset is ready for commercial use and comes with specific licensing terms for different subsets.

提供机构：

semran1

5,000+

优质数据集

54 个

任务类型

进入经典数据集