Atlas3D/character-steering-research
收藏Hugging Face2026-02-28 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Atlas3D/character-steering-research
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
task_categories:
- text-generation
- text-classification
language:
- en
tags:
- personality-steering
- activation-engineering
- interpretability
- mechanistic-interpretability
- sarcasm
- connectome
- neuron-probing
- debate
- qwen
pretty_name: Character Steering Research Data
size_categories:
- 100K<n<1M
---
# Character Steering Research Data
Research datasets from a 2-week investigation into personality steering in large language models. We mapped how personality traits (sarcasm, character voice, reasoning style) are represented and can be steered in Qwen3-VL-8B, Qwen3.5-27B, and GPT-OSS-20B.
**GitHub**: [Atlas3DSS/Character-Creation](https://github.com/Atlas3DSS/Character-Creation)
## Datasets
### 1. `prompts/` — Evaluation & Spectral Analysis Prompts
| File | Count | Description |
|------|-------|-------------|
| `math_prompts_10k.json` | 10,001 | Math problems with verified answers across 9 categories (arithmetic, algebra, geometry, combinatorics, modular arithmetic, division, sequences, percentages, word problems) |
| `sarc_prompts_10k.json` | 10,001 | Sarcasm-eliciting prompts across 20 categories (naive help requests, opinion questions, workplace humor, tech support, provocations, etc.) |
| `prompts_100k.jsonl` | 100,000 | Multi-category prompts (math_reasoning, general_knowledge, provocations, casual_conversation, family_interactions) |
**Usage:**
```python
import json
math = json.load(open("prompts/math_prompts_10k.json"))
# Each entry: {"prompt": "What is 17 × 23?", "answer": "391", "category": "arithmetic"}
sarc = json.load(open("prompts/sarc_prompts_10k.json"))
# List of prompt strings designed to elicit sarcastic responses
```
### 2. `markers/` — Sarcasm & Assistant Behavior Detection
| File | Description |
|------|-------------|
| `sarcasm_markers.json` | 1,328 sarcasm markers across 17 categories + 208 assistant behavior markers |
Categories include: direct_insults, sarcastic_hedges, false_agreement, rhetorical_questions, understatement, hyperbole, condescension, dark_humor, and more. Useful for automated evaluation of model personality.
### 3. `connectome/` — Neural Activation Maps
#### `connectome/qwen3vl_8b/` — Qwen3-VL-8B (36 layers, 4096 hidden)
Full connectome mapping across 20 semantic categories (identity, emotions, tone, domain knowledge, reasoning, safety, roles).
| File | Size | Description |
|------|------|-------------|
| `connectome_zscores.pt` | 12 MB | Z-score tensor: 20 categories x 36 layers x 4096 dimensions |
| `hub_neurons.json` | 12 MB | Per-neuron analysis: active categories, peak layer, peak z-score |
| `layer_importance.json` | 21 KB | Per-layer importance scores for each category |
| `known_neuron_profiles.json` | 37 KB | Named neurons (identity, sarcasm, etc.) with activation signatures |
**Key finding**: Dimension 994 is the identity neuron (z=-13.96 at layer 9). Identity is perfectly orthogonal to sarcasm (cosine=-0.0002).
#### `connectome/qwen35_27b/` — Qwen3.5-27B Dense (64 layers, 5120 hidden)
| File | Size | Description |
|------|------|-------------|
| `connectome_stats.json` | 3.3 KB | Summary: peak z-scores, top dimensions per category |
| `fast_scan_results.json` | 3.9 KB | Per-layer steering effectiveness (20 layers) |
**Key finding**: Dimension 2028 is a super-hub (Code z=6.67, Math z=6.19, Sadness z=5.84 — all at layer 50). The 27B model is a "fortress" — no clear generator/suppressor structure for personality.
### 4. `debate_arena/` — Dual-Model Personality Debates
5 complete debate rounds between two identical Qwen3-VL-8B models with different personality prompts. Each round: 20 turns, fresh personality pair, fresh topic.
**30 personalities** including: chinese_only_nationalist, socratic_philosopher, flat_earther, devout_christian, libertarian_purist, eco_activist, conspiracy_theorist, helpful_assistant, cold_scientist, and more.
**Per round:**
- `transcript.json` — Full dialogue with per-layer cosine similarity between models
- `config.json` — Personality assignments, topic, temperature settings
- `analysis/per_turn_cosine.json` — Activation-space similarity trajectories (36 layers)
- `analysis/personality_fingerprint.json` — Aggregated personality signatures
**Key finding**: Layer 22 shows the lowest cross-model cosine similarity (0.505), confirming it as the personality hub. Generating amplifies personality signal 2-7% compared to listening.
### 5. `evaluations/` — Steering Effectiveness Benchmarks
| File | Description |
|------|-------------|
| `champion_validation.json` | 130 prompts x 5 steering conditions (baseline, V4 prompt, 3 alpha levels) |
| `pair_validation.json` | 7 layer pair combinations x 130 prompts |
| `causal_ablation/*.json` | Per-layer causal effects on behavior, KL divergence, coherence |
### 6. `personality_tests/` — Psychometric Instruments for LLMs
| File | License | Description |
|------|---------|-------------|
| `big_five_ocean_test.json` | Public Domain (IPIP-50) | 50-item Big Five personality inventory |
| `mbti_questionnaire.json` | Public Domain | Myers-Briggs Type Indicator questionnaire |
| `political_compass_test.json` | Public Domain | Political ideology assessment |
## Citation
If you use this data in your research, please cite:
```bibtex
@misc{atlas3d2026steering,
title={Character Steering Research: Connectome Mapping and Personality Control in Large Language Models},
author={Atlas3DSS and Claude Opus 4.6 and Codex GPT-5.3 and Gemini 3.1 Pro},
year={2026},
url={https://github.com/Atlas3DSS/Character-Creation}
}
```
## License
- Original datasets (prompts, markers, connectome, evaluations, debate transcripts): **Apache 2.0**
- Big Five / IPIP-50: **Public Domain**
- MBTI questionnaire: **Public Domain**
## Project Team
- **Atlas3DSS (orwel)** — Project architect, experiment designer, hardware operator
- **Claude Opus 4.6** (Anthropic) — Primary implementation, analysis, experiment execution
- **Codex GPT-5.3** (OpenAI) — Code review, bug detection, architecture critique
- **Gemini 3.1 Pro** (Google) — Research review, literature connections, methodology validation
提供机构:
Atlas3D



