latent-lab/got-activations-llama3.1-405b-base
收藏Hugging Face2026-03-27 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/latent-lab/got-activations-llama3.1-405b-base
下载链接
链接失效反馈官方服务:
资源简介:
---
tags:
- lmprobe
- activations
- interpretability
- meta-llama-llama-3.1-405b
task_categories:
- feature-extraction
language:
- en
license: cc-by-4.0
---
# meta-llama/Llama-3.1-405B — Activation Dataset
Cached activations extracted from [`meta-llama/Llama-3.1-405B`](https://huggingface.co/meta-llama/Llama-3.1-405B) (revision `unknown`).
## Contents
| Tensor | Layers | Dim | Pooling | Shards | Row Bytes |
|--------|--------|-----|---------|--------|-----------|
| hidden_layers | 0-125 | 16384 | - | 12 | - |
- **Prompts:** 7660
- **Format version:** 1.1
## Load with lmprobe
```python
from lmprobe import pull_dataset, load_activation_dataset
# Option 1: Pull into local cache (enables probe training without re-extraction)
pull_dataset("latent-lab/got-activations-llama3.1-405b-base")
# Option 2: Load tensors directly
tensors, info = load_activation_dataset("latent-lab/got-activations-llama3.1-405b-base")
# tensors["hidden.layer_16"].shape => (N, hidden_dim)
```
## Load without lmprobe (standalone)
```python
import json
import pyarrow.parquet as pq
from safetensors import safe_open
# 1. Read the Parquet index
index = pq.read_table("index/train-00000-of-00001.parquet").to_pandas()
print(index.columns) # text, label, num_tokens, shard_index, row_offset
# 2. Read tensor metadata
with open("lmprobe_info.json") as f:
info = json.load(f)
print(list(info["tensors"].keys())) # e.g. ["hidden_layers", "logits_topk"]
# 3. Load a shard — per-layer files: hidden_layer{L:03d}_shard{S:03d}.safetensors
with safe_open("tensors/hidden_layer000_shard000.safetensors", framework="pt") as f:
print(f.keys()) # e.g. ["hidden.layer_0"]
layer_0 = f.get_tensor("hidden.layer_0")
# 4. Map prompt index -> shard row
row = index.iloc[42]
tok_off, num_tok = row["token_offset"], row["num_tokens"]
# Slice full-sequence activations for this prompt
prompt_acts = layer_0[tok_off : tok_off + num_tok] # (num_tokens, hidden_dim)
```
## Load with HF Datasets
```python
from datasets import load_dataset
# Shows prompt text + labels in Dataset Viewer
ds = load_dataset("latent-lab/got-activations-llama3.1-405b-base")
print(ds["train"][0]) # {"text": "...", "label": ..., ...}
```
## Provenance
- **lmprobe version:** 0.8.0
- **Extraction backend:** ndif
- **Created:** 2025-03-19T00:00:00+00:00
- **PyTorch:** unknown
- **Transformers:** unknown
---
tags:
- lmprobe(语言模型探针工具)
- 激活值(activations)
- 可解释性(interpretability)
- meta-llama/Llama-3.1-405B
task_categories:
- 特征提取
language:
- en
license: 知识共享署名4.0国际许可协议(CC BY 4.0)
---
# meta-llama/Llama-3.1-405B — 激活值数据集
从[`meta-llama/Llama-3.1-405B`](https://huggingface.co/meta-llama/Llama-3.1-405B)(修订版本`unknown`)中提取的缓存激活值。
## 数据集内容
| 张量名称 | 覆盖层数 | 维度 | 池化方式 | 分片数 | 行字节数 |
|--------|--------|-----|---------|--------|-----------|
| hidden_layers | 0-125 | 16384 | - | 12 | - |
- **提示词总数:** 7660
- **格式版本:** 1.1
## 使用lmprobe加载
python
from lmprobe import pull_dataset, load_activation_dataset
# 方案1:拉取至本地缓存(无需重复提取即可训练探针模型)
pull_dataset("latent-lab/got-activations-llama3.1-405b-base")
# 方案2:直接加载张量
tensors, info = load_activation_dataset("latent-lab/got-activations-llama3.1-405b-base")
# tensors["hidden.layer_16"].shape => (N, 隐藏维度)
## 不依赖lmprobe的独立加载方式
python
import json
import pyarrow.parquet as pq
from safetensors import safe_open
# 1. 读取Parquet索引文件
index = pq.read_table("index/train-00000-of-00001.parquet").to_pandas()
print(index.columns) # 输出列名:text, label, num_tokens, shard_index, row_offset
# 2. 读取张量元数据
with open("lmprobe_info.json") as f:
info = json.load(f)
print(list(info["tensors"].keys())) # 例如:["hidden_layers", "logits_topk"]
# 3. 加载单个分片 — 按层命名的文件格式:hidden_layer{L:03d}_shard{S:03d}.safetensors
with safe_open("tensors/hidden_layer000_shard000.safetensors", framework="pt") as f:
print(f.keys()) # 例如:["hidden.layer_0"]
layer_0 = f.get_tensor("hidden.layer_0")
# 4. 映射提示词索引 -> 分片行索引
row = index.iloc[42]
tok_off, num_tok = row["token_offset"], row["num_tokens"]
# 提取当前提示词的全序列激活值
prompt_acts = layer_0[tok_off : tok_off + num_tok] # (num_tokens, hidden_dim)
## 使用Hugging Face Datasets加载
python
from datasets import load_dataset
# 在数据集查看器中展示提示词文本与标签
ds = load_dataset("latent-lab/got-activations-llama3.1-405b-base")
print(ds["train"][0]) # 输出格式:{"text": "...", "label": ..., ...}
## 数据集溯源
- **lmprobe版本:** 0.8.0
- **提取后端:** ndif
- **创建时间:** 2025-03-19T00:00:00+00:00
- **PyTorch版本:** 未知
- **Transformers库版本:** 未知
提供机构:
latent-lab



