five

latent-lab/got-activations-llama3.1-405b-base

收藏
Hugging Face2026-03-27 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/latent-lab/got-activations-llama3.1-405b-base
下载链接
链接失效反馈
官方服务:
资源简介:
--- tags: - lmprobe - activations - interpretability - meta-llama-llama-3.1-405b task_categories: - feature-extraction language: - en license: cc-by-4.0 --- # meta-llama/Llama-3.1-405B — Activation Dataset Cached activations extracted from [`meta-llama/Llama-3.1-405B`](https://huggingface.co/meta-llama/Llama-3.1-405B) (revision `unknown`). ## Contents | Tensor | Layers | Dim | Pooling | Shards | Row Bytes | |--------|--------|-----|---------|--------|-----------| | hidden_layers | 0-125 | 16384 | - | 12 | - | - **Prompts:** 7660 - **Format version:** 1.1 ## Load with lmprobe ```python from lmprobe import pull_dataset, load_activation_dataset # Option 1: Pull into local cache (enables probe training without re-extraction) pull_dataset("latent-lab/got-activations-llama3.1-405b-base") # Option 2: Load tensors directly tensors, info = load_activation_dataset("latent-lab/got-activations-llama3.1-405b-base") # tensors["hidden.layer_16"].shape => (N, hidden_dim) ``` ## Load without lmprobe (standalone) ```python import json import pyarrow.parquet as pq from safetensors import safe_open # 1. Read the Parquet index index = pq.read_table("index/train-00000-of-00001.parquet").to_pandas() print(index.columns) # text, label, num_tokens, shard_index, row_offset # 2. Read tensor metadata with open("lmprobe_info.json") as f: info = json.load(f) print(list(info["tensors"].keys())) # e.g. ["hidden_layers", "logits_topk"] # 3. Load a shard — per-layer files: hidden_layer{L:03d}_shard{S:03d}.safetensors with safe_open("tensors/hidden_layer000_shard000.safetensors", framework="pt") as f: print(f.keys()) # e.g. ["hidden.layer_0"] layer_0 = f.get_tensor("hidden.layer_0") # 4. Map prompt index -> shard row row = index.iloc[42] tok_off, num_tok = row["token_offset"], row["num_tokens"] # Slice full-sequence activations for this prompt prompt_acts = layer_0[tok_off : tok_off + num_tok] # (num_tokens, hidden_dim) ``` ## Load with HF Datasets ```python from datasets import load_dataset # Shows prompt text + labels in Dataset Viewer ds = load_dataset("latent-lab/got-activations-llama3.1-405b-base") print(ds["train"][0]) # {"text": "...", "label": ..., ...} ``` ## Provenance - **lmprobe version:** 0.8.0 - **Extraction backend:** ndif - **Created:** 2025-03-19T00:00:00+00:00 - **PyTorch:** unknown - **Transformers:** unknown

--- tags: - lmprobe(语言模型探针工具) - 激活值(activations) - 可解释性(interpretability) - meta-llama/Llama-3.1-405B task_categories: - 特征提取 language: - en license: 知识共享署名4.0国际许可协议(CC BY 4.0) --- # meta-llama/Llama-3.1-405B — 激活值数据集 从[`meta-llama/Llama-3.1-405B`](https://huggingface.co/meta-llama/Llama-3.1-405B)(修订版本`unknown`)中提取的缓存激活值。 ## 数据集内容 | 张量名称 | 覆盖层数 | 维度 | 池化方式 | 分片数 | 行字节数 | |--------|--------|-----|---------|--------|-----------| | hidden_layers | 0-125 | 16384 | - | 12 | - | - **提示词总数:** 7660 - **格式版本:** 1.1 ## 使用lmprobe加载 python from lmprobe import pull_dataset, load_activation_dataset # 方案1:拉取至本地缓存(无需重复提取即可训练探针模型) pull_dataset("latent-lab/got-activations-llama3.1-405b-base") # 方案2:直接加载张量 tensors, info = load_activation_dataset("latent-lab/got-activations-llama3.1-405b-base") # tensors["hidden.layer_16"].shape => (N, 隐藏维度) ## 不依赖lmprobe的独立加载方式 python import json import pyarrow.parquet as pq from safetensors import safe_open # 1. 读取Parquet索引文件 index = pq.read_table("index/train-00000-of-00001.parquet").to_pandas() print(index.columns) # 输出列名:text, label, num_tokens, shard_index, row_offset # 2. 读取张量元数据 with open("lmprobe_info.json") as f: info = json.load(f) print(list(info["tensors"].keys())) # 例如:["hidden_layers", "logits_topk"] # 3. 加载单个分片 — 按层命名的文件格式:hidden_layer{L:03d}_shard{S:03d}.safetensors with safe_open("tensors/hidden_layer000_shard000.safetensors", framework="pt") as f: print(f.keys()) # 例如:["hidden.layer_0"] layer_0 = f.get_tensor("hidden.layer_0") # 4. 映射提示词索引 -> 分片行索引 row = index.iloc[42] tok_off, num_tok = row["token_offset"], row["num_tokens"] # 提取当前提示词的全序列激活值 prompt_acts = layer_0[tok_off : tok_off + num_tok] # (num_tokens, hidden_dim) ## 使用Hugging Face Datasets加载 python from datasets import load_dataset # 在数据集查看器中展示提示词文本与标签 ds = load_dataset("latent-lab/got-activations-llama3.1-405b-base") print(ds["train"][0]) # 输出格式:{"text": "...", "label": ..., ...} ## 数据集溯源 - **lmprobe版本:** 0.8.0 - **提取后端:** ndif - **创建时间:** 2025-03-19T00:00:00+00:00 - **PyTorch版本:** 未知 - **Transformers库版本:** 未知
提供机构:
latent-lab
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作