allegrolab/dclm-baseline-500b_toks

Name: allegrolab/dclm-baseline-500b_toks
Creator: allegrolab
Published: 2025-10-23 17:18:00
License: 暂无描述

Hugging Face2025-10-23 更新2025-10-25 收录

下载链接：

https://hf-mirror.com/datasets/allegrolab/dclm-baseline-500b_toks

下载链接

链接失效反馈

官方服务：

资源简介：

DCLM Baseline 500B Tokens (Decontaminated) 数据集是一个经过去污染处理的子集，从 DCLM-Baseline 语料库中专门为 Hubble 记忆研究项目准备的。该数据集已被仔细处理，以消除与记忆评估数据的重叠，并围绕 5000 亿个英文文本进行子采样。这个语料库作为所有 Hubble 模型的基本训练数据，提供了一个干净的基线，用于研究大型语言模型中的记忆现象，同时试图消除污染引起的混杂效应。

This dataset is a decontaminated subset of the DCLM-Baseline corpus, specifically prepared for the Hubble memorization research project. The dataset has been carefully processed to remove overlap with memorization evaluation data and subsampled around 500 billion tokens of English text. This corpus serves as the foundational training data for all Hubble models, providing a clean baseline for studying memorization phenomena in large language models while attempting to remove confounding effects from contamination.

提供机构：

allegrolab

5,000+

优质数据集

54 个

任务类型

进入经典数据集