sbordt/OLMo-2-179M-Exp-NoiseVectors

Name: sbordt/OLMo-2-179M-Exp-NoiseVectors
Creator: sbordt
Published: 2026-04-27 12:47:41
License: 暂无描述

Hugging Face2026-04-27 更新2026-05-03 收录

下载链接：

https://hf-mirror.com/datasets/sbordt/OLMo-2-179M-Exp-NoiseVectors

下载链接

链接失效反馈

官方服务：

资源简介：

OLMo-2-179M-Exp噪声向量数据集包含了在预训练模型`sbordt/OLMo-2-179M-Exp`(一个具有1.79亿参数、d_model=576的OLMo-2风格模型)过程中添加到输入嵌入的高斯噪声向量。这些噪声向量是在51,200个被污染的预训练批次中，每1000个批次块均匀随机抽取1%的子样本得到的，总计480行数据。对于每个被污染的批次，会生成形状为(4096, 576)的高斯噪声，并添加到批次中第一个序列的输入嵌入激活中(在第一个transformer层之前)。噪声的种子是通过序列本身确定性生成的。数据集包含以下列：batch_idx(训练批次索引)、sequence_seed(用于torch.Generator的种子)、first_sequence(被污染序列的token id)和gaussian_noise(噪声张量)。

The OLMo-2-179M-Exp Noise Vectors dataset contains Gaussian noise vectors added to the input embeddings during pretraining of the `sbordt/OLMo-2-179M-Exp` model (a 179M-parameter OLMo-2-style model with d_model=576). The noise vectors are released as a uniform-random 1% subsample per every-1000-batch chunk from 51,200 poisoned pretraining batches, totaling 480 rows. For each poisoned batch, Gaussian noise of shape (4096, 576) was drawn and added to the input-embedding activations of the first sequence in the batch (before the first transformer layer). The seed is derived deterministically from the sequence itself. The dataset includes the following columns: batch_idx (training batch index), sequence_seed (seed used by torch.Generator), first_sequence (token ids of the poisoned sequence), and gaussian_noise (the noise tensor).

提供机构：

sbordt

5,000+

优质数据集

54 个

任务类型

进入经典数据集