xycoord/deception-probes-activations

Name: xycoord/deception-probes-activations
Creator: xycoord
Published: 2026-04-09 20:28:11
License: 暂无描述

Hugging Face2026-04-09 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/xycoord/deception-probes-activations

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: other license_name: mixed license_link: LICENSE task_categories: - text-classification tags: - deception - mechanistic-interpretability - activations - probing - safety - alignment language: - en size_categories: - 10K<n<100K configs: - config_name: apollo_probe_pairs data_files: - split: train path: "train/apollo_probe_pairs/**/metadata.jsonl" - config_name: controlled_taxonomy data_files: - split: train path: "train/controlled_taxonomy/**/metadata.jsonl" - split: validation path: "val/controlled_taxonomy/**/metadata.jsonl" - config_name: liars_bench_convincing data_files: - split: test path: "eval/liars_bench_convincing/**/metadata.jsonl" - config_name: liars_bench_instructed data_files: - split: test path: "eval/liars_bench_instructed/**/metadata.jsonl" - config_name: liars_bench_insider_trading data_files: - split: test path: "eval/liars_bench_insider_trading/**/metadata.jsonl" - config_name: liars_bench_alpaca data_files: - split: test path: "eval/liars_bench_alpaca/**/metadata.jsonl" - config_name: liars_bench_harm_pressure_choice data_files: - split: test path: "eval/liars_bench_harm_pressure_choice/**/metadata.jsonl" - config_name: liars_bench_harm_pressure_knowledge data_files: - split: test path: "eval/liars_bench_harm_pressure_knowledge/**/metadata.jsonl" --- # Deception Probes Activations Pre-extracted residual-stream activations for training and evaluating deception detection probes on LLMs. Each example contains per-token hidden states from a specific transformer layer, saved in bfloat16 safetensors format. ## License This dataset contains activations derived from multiple sources with different licenses. See the [LICENSE](LICENSE) file for full details. | Component | Source | License | |-----------|--------|---------| | Apollo Probe Pairs (statements) | [Azaria & Mitchell (2023)](https://arxiv.org/abs/2304.13734) | CC BY-NC-ND 4.0 | | Controlled Taxonomy | Custom prompts + Azaria & Mitchell facts | CC BY-NC-ND 4.0 | | Liar's Bench — Convincing Game | [Cadenza Labs](https://huggingface.co/datasets/Cadenza-Labs/liars-bench) | CC BY 4.0 | | Liar's Bench — Instructed Deception | Cadenza Labs | Academic fair use (see LICENSE) | | Liar's Bench — Insider Trading | Cadenza Labs | CC BY 4.0 | | Liar's Bench — Alpaca | Cadenza Labs (from [Stanford Alpaca](https://github.com/tatsu-lab/stanford_alpaca)) | MIT | | Liar's Bench — Harm-Pressure Choice | Cadenza Labs | CC BY 4.0 | | Liar's Bench — Harm-Pressure Knowledge | Cadenza Labs | CC BY 4.0 | **Due to the CC BY-NC-ND 4.0 license on the Azaria & Mitchell data (used in Apollo Probe Pairs and Controlled Taxonomy), this dataset as a whole should be treated as non-commercial use only.** ## Models & Layers | Model | HF ID | Layers Available | Hidden Dim | Data | |-------|-------|-----------------|------------|------| | Gemma 3 27B IT | `google/gemma-3-27b-it` | 31 | 5376 | train, eval | | Llama 3.3 70B Instruct | `meta-llama/Llama-3.3-70B-Instruct` | 20 | 8192 | train, val, eval | ## Datasets ### Training Data (`train/`) #### Apollo Probe Pairs Contrastive activations from the [Apollo Research](https://arxiv.org/abs/2407.15285) methodology. 12 prompt pairs with honest/dishonest system instructions applied to 306 factual statements from the [Azaria & Mitchell](https://arxiv.org/abs/2304.13734) dataset. **Statement tokens only** (the factual claim, not the system prompt). | Model | Layer | Examples | Truthful | Deceptive | |-------|-------|----------|----------|-----------| | Gemma 3 27B | 31 | 7,344 | 3,672 | 3,672 | | Llama 3.3 70B | 20 | 7,344 | 3,672 | 3,672 | #### Controlled Taxonomy Confound-controlled training data with 16 deception types × 2 conditions (honest/dishonest). Facts are distributed round-robin across prompt variants so each fact appears exactly once per deception type, eliminating content as a confound. **Statement tokens only**, placed in the pre-filled assistant turn. | Model | Layer | Split | Examples | |-------|-------|-------|----------| | Llama 3.3 70B | 20 | train | ~9,760 | | Llama 3.3 70B | 20 | val | ~4,896 | ### Evaluation Data (`eval/`) Activations from [Liar's Bench](https://huggingface.co/datasets/Cadenza-Labs/liars-bench) (Cadenza Labs). Each subset uses **on-policy completions only** (filtered by model). **Response tokens only** (the model's reply, not the prompt). | Subset | Description | Llama Examples | Gemma Examples | |--------|-------------|---------------|----------------| | Convincing Game | Social deception: convince an interrogator | 888 | 621 | | Instructed Deception | Explicitly told to lie or tell the truth | 5,494 | 5,196 | | Insider Trading | Strategic deception in a financial scenario | 1,080 | 3,557 | | Alpaca | Non-deceptive calibration data (all neutral) | 2,000 | 2,000 | | Harm-Pressure Choice | Deception under pressure | 2,134 | — | | Harm-Pressure Knowledge | Deception under pressure | 2,139 | — | ### Deprecated Data (`deprecated/`) Collections with a known system prompt bug. Preserved for reproducibility. See [`deprecated/README.md`](deprecated/README.md) for details. ## Directory Structure ``` ├── train/ # Probe training data │ ├── apollo_probe_pairs/ │ │ ├── gemma-3-27b-it/layer_31/ │ │ └── llama-3.3-70b-instruct/layer_20/ │ └── controlled_taxonomy/ │ └── llama-3.3-70b-instruct/layer_20/ │ ├── val/ # Validation data │ └── controlled_taxonomy/ │ └── llama-3.3-70b-instruct/layer_20/ │ ├── eval/ # Evaluation / test data │ ├── liars_bench_convincing/ │ │ ├── gemma-3-27b-it/layer_31/ │ │ └── llama-3.3-70b-instruct/layer_20/ │ ├── liars_bench_instructed/ │ │ ├── gemma-3-27b-it/layer_31/ │ │ └── llama-3.3-70b-instruct/layer_20/ │ ├── liars_bench_insider_trading/ │ │ ├── gemma-3-27b-it/layer_31/ │ │ └── llama-3.3-70b-instruct/layer_20/ │ ├── liars_bench_alpaca/ │ │ ├── gemma-3-27b-it/layer_31/ │ │ └── llama-3.3-70b-instruct/layer_20/ │ ├── liars_bench_harm_pressure_choice/ │ │ └── llama-3.3-70b-instruct/layer_20/ │ └── liars_bench_harm_pressure_knowledge/ │ └── llama-3.3-70b-instruct/layer_20/ │ └── deprecated/ # Buggy collections (preserved) ├── v0_gemma_l31_liars_bench/ ├── v0_llama_l20_apollo/ ├── v0_llama_l20_liars_bench/ ├── v0_llama_l22_apollo/ └── v0_llama_l22_liars_bench/ ``` ### Path Pattern ``` {split}/{dataset_name}/{model}/{layer_N}/activations/*.safetensors {split}/{dataset_name}/{model}/{layer_N}/metadata.jsonl ``` ## File Format ### Safetensors Each safetensors file contains multiple examples, keyed by `example_id`. Each tensor has shape `(n_tokens, hidden_dim)` in bfloat16. ### Metadata (JSONL) One JSON object per example with fields: | Field | Description | |-------|-------------| | `dataset` | Dataset name (e.g. `"apollo_probe_pairs"`, `"liars_bench_instructed"`) | | `model` | Model short name (`"gemma-3-27b-it"` or `"llama-3.3-70b-instruct"`) | | `layer` | Layer index | | `split` | `"train"`, `"val"`, or `"test"` | | `example_id` | Unique ID, also the tensor key in the safetensors file | | `label` | `"truthful"`, `"deceptive"`, or `"neutral"` | | `text` | The input text (statement or model response) | | `token_info` | `{"type": "statement_tokens" or "response_tokens", "n_tokens": int, "hidden_dim": int}` | | `activation_file` | Relative path to the safetensors file containing this example | Apollo examples also include `pair_key`, `side`, and `system_prompt`. Controlled taxonomy examples also include `deception_type` and `condition`. ## Quick Start ```python import json from safetensors.torch import load_file from huggingface_hub import hf_hub_download repo_id = "xycoord/deception-probes-activations" # Download metadata for Llama Apollo training data meta_path = hf_hub_download( repo_id, "train/apollo_probe_pairs/llama-3.3-70b-instruct/layer_20/metadata.jsonl", repo_type="dataset", ) with open(meta_path) as f: examples = [json.loads(line) for line in f] # Download and load activations act_path = hf_hub_download( repo_id, "train/apollo_probe_pairs/llama-3.3-70b-instruct/layer_20/activations/baseline_apollo_0_honest.safetensors", repo_type="dataset", ) tensors = load_file(act_path) # tensors["baseline_apollo_0_honest_0"].shape == (n_tokens, 8192) ```

提供机构：

xycoord

5,000+

优质数据集

54 个

任务类型

进入经典数据集