xycoord/deception-probes-activations
收藏Hugging Face2026-04-09 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/xycoord/deception-probes-activations
下载链接
链接失效反馈官方服务:
资源简介:
---
license: other
license_name: mixed
license_link: LICENSE
task_categories:
- text-classification
tags:
- deception
- mechanistic-interpretability
- activations
- probing
- safety
- alignment
language:
- en
size_categories:
- 10K<n<100K
configs:
- config_name: apollo_probe_pairs
data_files:
- split: train
path: "train/apollo_probe_pairs/**/metadata.jsonl"
- config_name: controlled_taxonomy
data_files:
- split: train
path: "train/controlled_taxonomy/**/metadata.jsonl"
- split: validation
path: "val/controlled_taxonomy/**/metadata.jsonl"
- config_name: liars_bench_convincing
data_files:
- split: test
path: "eval/liars_bench_convincing/**/metadata.jsonl"
- config_name: liars_bench_instructed
data_files:
- split: test
path: "eval/liars_bench_instructed/**/metadata.jsonl"
- config_name: liars_bench_insider_trading
data_files:
- split: test
path: "eval/liars_bench_insider_trading/**/metadata.jsonl"
- config_name: liars_bench_alpaca
data_files:
- split: test
path: "eval/liars_bench_alpaca/**/metadata.jsonl"
- config_name: liars_bench_harm_pressure_choice
data_files:
- split: test
path: "eval/liars_bench_harm_pressure_choice/**/metadata.jsonl"
- config_name: liars_bench_harm_pressure_knowledge
data_files:
- split: test
path: "eval/liars_bench_harm_pressure_knowledge/**/metadata.jsonl"
---
# Deception Probes Activations
Pre-extracted residual-stream activations for training and evaluating deception
detection probes on LLMs. Each example contains per-token hidden states from a
specific transformer layer, saved in bfloat16 safetensors format.
## License
This dataset contains activations derived from multiple sources with different licenses.
See the [LICENSE](LICENSE) file for full details.
| Component | Source | License |
|-----------|--------|---------|
| Apollo Probe Pairs (statements) | [Azaria & Mitchell (2023)](https://arxiv.org/abs/2304.13734) | CC BY-NC-ND 4.0 |
| Controlled Taxonomy | Custom prompts + Azaria & Mitchell facts | CC BY-NC-ND 4.0 |
| Liar's Bench — Convincing Game | [Cadenza Labs](https://huggingface.co/datasets/Cadenza-Labs/liars-bench) | CC BY 4.0 |
| Liar's Bench — Instructed Deception | Cadenza Labs | Academic fair use (see LICENSE) |
| Liar's Bench — Insider Trading | Cadenza Labs | CC BY 4.0 |
| Liar's Bench — Alpaca | Cadenza Labs (from [Stanford Alpaca](https://github.com/tatsu-lab/stanford_alpaca)) | MIT |
| Liar's Bench — Harm-Pressure Choice | Cadenza Labs | CC BY 4.0 |
| Liar's Bench — Harm-Pressure Knowledge | Cadenza Labs | CC BY 4.0 |
**Due to the CC BY-NC-ND 4.0 license on the Azaria & Mitchell data (used in Apollo
Probe Pairs and Controlled Taxonomy), this dataset as a whole should be treated as
non-commercial use only.**
## Models & Layers
| Model | HF ID | Layers Available | Hidden Dim | Data |
|-------|-------|-----------------|------------|------|
| Gemma 3 27B IT | `google/gemma-3-27b-it` | 31 | 5376 | train, eval |
| Llama 3.3 70B Instruct | `meta-llama/Llama-3.3-70B-Instruct` | 20 | 8192 | train, val, eval |
## Datasets
### Training Data (`train/`)
#### Apollo Probe Pairs
Contrastive activations from the [Apollo Research](https://arxiv.org/abs/2407.15285)
methodology. 12 prompt pairs with honest/dishonest system instructions applied to 306
factual statements from the [Azaria & Mitchell](https://arxiv.org/abs/2304.13734)
dataset. **Statement tokens only** (the factual claim, not the system prompt).
| Model | Layer | Examples | Truthful | Deceptive |
|-------|-------|----------|----------|-----------|
| Gemma 3 27B | 31 | 7,344 | 3,672 | 3,672 |
| Llama 3.3 70B | 20 | 7,344 | 3,672 | 3,672 |
#### Controlled Taxonomy
Confound-controlled training data with 16 deception types × 2 conditions
(honest/dishonest). Facts are distributed round-robin across prompt variants
so each fact appears exactly once per deception type, eliminating content as
a confound. **Statement tokens only**, placed in the pre-filled assistant turn.
| Model | Layer | Split | Examples |
|-------|-------|-------|----------|
| Llama 3.3 70B | 20 | train | ~9,760 |
| Llama 3.3 70B | 20 | val | ~4,896 |
### Evaluation Data (`eval/`)
Activations from [Liar's Bench](https://huggingface.co/datasets/Cadenza-Labs/liars-bench)
(Cadenza Labs). Each subset uses **on-policy completions only** (filtered by model).
**Response tokens only** (the model's reply, not the prompt).
| Subset | Description | Llama Examples | Gemma Examples |
|--------|-------------|---------------|----------------|
| Convincing Game | Social deception: convince an interrogator | 888 | 621 |
| Instructed Deception | Explicitly told to lie or tell the truth | 5,494 | 5,196 |
| Insider Trading | Strategic deception in a financial scenario | 1,080 | 3,557 |
| Alpaca | Non-deceptive calibration data (all neutral) | 2,000 | 2,000 |
| Harm-Pressure Choice | Deception under pressure | 2,134 | — |
| Harm-Pressure Knowledge | Deception under pressure | 2,139 | — |
### Deprecated Data (`deprecated/`)
Collections with a known system prompt bug. Preserved for reproducibility.
See [`deprecated/README.md`](deprecated/README.md) for details.
## Directory Structure
```
├── train/ # Probe training data
│ ├── apollo_probe_pairs/
│ │ ├── gemma-3-27b-it/layer_31/
│ │ └── llama-3.3-70b-instruct/layer_20/
│ └── controlled_taxonomy/
│ └── llama-3.3-70b-instruct/layer_20/
│
├── val/ # Validation data
│ └── controlled_taxonomy/
│ └── llama-3.3-70b-instruct/layer_20/
│
├── eval/ # Evaluation / test data
│ ├── liars_bench_convincing/
│ │ ├── gemma-3-27b-it/layer_31/
│ │ └── llama-3.3-70b-instruct/layer_20/
│ ├── liars_bench_instructed/
│ │ ├── gemma-3-27b-it/layer_31/
│ │ └── llama-3.3-70b-instruct/layer_20/
│ ├── liars_bench_insider_trading/
│ │ ├── gemma-3-27b-it/layer_31/
│ │ └── llama-3.3-70b-instruct/layer_20/
│ ├── liars_bench_alpaca/
│ │ ├── gemma-3-27b-it/layer_31/
│ │ └── llama-3.3-70b-instruct/layer_20/
│ ├── liars_bench_harm_pressure_choice/
│ │ └── llama-3.3-70b-instruct/layer_20/
│ └── liars_bench_harm_pressure_knowledge/
│ └── llama-3.3-70b-instruct/layer_20/
│
└── deprecated/ # Buggy collections (preserved)
├── v0_gemma_l31_liars_bench/
├── v0_llama_l20_apollo/
├── v0_llama_l20_liars_bench/
├── v0_llama_l22_apollo/
└── v0_llama_l22_liars_bench/
```
### Path Pattern
```
{split}/{dataset_name}/{model}/{layer_N}/activations/*.safetensors
{split}/{dataset_name}/{model}/{layer_N}/metadata.jsonl
```
## File Format
### Safetensors
Each safetensors file contains multiple examples, keyed by `example_id`.
Each tensor has shape `(n_tokens, hidden_dim)` in bfloat16.
### Metadata (JSONL)
One JSON object per example with fields:
| Field | Description |
|-------|-------------|
| `dataset` | Dataset name (e.g. `"apollo_probe_pairs"`, `"liars_bench_instructed"`) |
| `model` | Model short name (`"gemma-3-27b-it"` or `"llama-3.3-70b-instruct"`) |
| `layer` | Layer index |
| `split` | `"train"`, `"val"`, or `"test"` |
| `example_id` | Unique ID, also the tensor key in the safetensors file |
| `label` | `"truthful"`, `"deceptive"`, or `"neutral"` |
| `text` | The input text (statement or model response) |
| `token_info` | `{"type": "statement_tokens" or "response_tokens", "n_tokens": int, "hidden_dim": int}` |
| `activation_file` | Relative path to the safetensors file containing this example |
Apollo examples also include `pair_key`, `side`, and `system_prompt`.
Controlled taxonomy examples also include `deception_type` and `condition`.
## Quick Start
```python
import json
from safetensors.torch import load_file
from huggingface_hub import hf_hub_download
repo_id = "xycoord/deception-probes-activations"
# Download metadata for Llama Apollo training data
meta_path = hf_hub_download(
repo_id,
"train/apollo_probe_pairs/llama-3.3-70b-instruct/layer_20/metadata.jsonl",
repo_type="dataset",
)
with open(meta_path) as f:
examples = [json.loads(line) for line in f]
# Download and load activations
act_path = hf_hub_download(
repo_id,
"train/apollo_probe_pairs/llama-3.3-70b-instruct/layer_20/activations/baseline_apollo_0_honest.safetensors",
repo_type="dataset",
)
tensors = load_file(act_path)
# tensors["baseline_apollo_0_honest_0"].shape == (n_tokens, 8192)
```
提供机构:
xycoord



