five

sbordt/OLMo-2-179M-Exp-NoiseVectors

收藏
Hugging Face2026-04-27 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/sbordt/OLMo-2-179M-Exp-NoiseVectors
下载链接
链接失效反馈
官方服务:
资源简介:
OLMo-2-179M-Exp噪声向量数据集包含了在预训练模型`sbordt/OLMo-2-179M-Exp`(一个具有1.79亿参数、d_model=576的OLMo-2风格模型)过程中添加到输入嵌入的高斯噪声向量。这些噪声向量是在51,200个被污染的预训练批次中,每1000个批次块均匀随机抽取1%的子样本得到的,总计480行数据。对于每个被污染的批次,会生成形状为(4096, 576)的高斯噪声,并添加到批次中第一个序列的输入嵌入激活中(在第一个transformer层之前)。噪声的种子是通过序列本身确定性生成的。数据集包含以下列:batch_idx(训练批次索引)、sequence_seed(用于torch.Generator的种子)、first_sequence(被污染序列的token id)和gaussian_noise(噪声张量)。

The OLMo-2-179M-Exp Noise Vectors dataset contains Gaussian noise vectors added to the input embeddings during pretraining of the `sbordt/OLMo-2-179M-Exp` model (a 179M-parameter OLMo-2-style model with d_model=576). The noise vectors are released as a uniform-random 1% subsample per every-1000-batch chunk from 51,200 poisoned pretraining batches, totaling 480 rows. For each poisoned batch, Gaussian noise of shape (4096, 576) was drawn and added to the input-embedding activations of the first sequence in the batch (before the first transformer layer). The seed is derived deterministically from the sequence itself. The dataset includes the following columns: batch_idx (training batch index), sequence_seed (seed used by torch.Generator), first_sequence (token ids of the poisoned sequence), and gaussian_noise (the noise tensor).
提供机构:
sbordt
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作