Atlas3D/character-steering-research

Name: Atlas3D/character-steering-research
Creator: Atlas3D
Published: 2026-02-28 03:48:06
License: 暂无描述

Hugging Face2026-02-28 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/Atlas3D/character-steering-research

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 task_categories: - text-generation - text-classification language: - en tags: - personality-steering - activation-engineering - interpretability - mechanistic-interpretability - sarcasm - connectome - neuron-probing - debate - qwen pretty_name: Character Steering Research Data size_categories: - 100K<n<1M --- # Character Steering Research Data Research datasets from a 2-week investigation into personality steering in large language models. We mapped how personality traits (sarcasm, character voice, reasoning style) are represented and can be steered in Qwen3-VL-8B, Qwen3.5-27B, and GPT-OSS-20B. **GitHub**: [Atlas3DSS/Character-Creation](https://github.com/Atlas3DSS/Character-Creation) ## Datasets ### 1. `prompts/` — Evaluation & Spectral Analysis Prompts | File | Count | Description | |------|-------|-------------| | `math_prompts_10k.json` | 10,001 | Math problems with verified answers across 9 categories (arithmetic, algebra, geometry, combinatorics, modular arithmetic, division, sequences, percentages, word problems) | | `sarc_prompts_10k.json` | 10,001 | Sarcasm-eliciting prompts across 20 categories (naive help requests, opinion questions, workplace humor, tech support, provocations, etc.) | | `prompts_100k.jsonl` | 100,000 | Multi-category prompts (math_reasoning, general_knowledge, provocations, casual_conversation, family_interactions) | **Usage:** ```python import json math = json.load(open("prompts/math_prompts_10k.json")) # Each entry: {"prompt": "What is 17 × 23?", "answer": "391", "category": "arithmetic"} sarc = json.load(open("prompts/sarc_prompts_10k.json")) # List of prompt strings designed to elicit sarcastic responses ``` ### 2. `markers/` — Sarcasm & Assistant Behavior Detection | File | Description | |------|-------------| | `sarcasm_markers.json` | 1,328 sarcasm markers across 17 categories + 208 assistant behavior markers | Categories include: direct_insults, sarcastic_hedges, false_agreement, rhetorical_questions, understatement, hyperbole, condescension, dark_humor, and more. Useful for automated evaluation of model personality. ### 3. `connectome/` — Neural Activation Maps #### `connectome/qwen3vl_8b/` — Qwen3-VL-8B (36 layers, 4096 hidden) Full connectome mapping across 20 semantic categories (identity, emotions, tone, domain knowledge, reasoning, safety, roles). | File | Size | Description | |------|------|-------------| | `connectome_zscores.pt` | 12 MB | Z-score tensor: 20 categories x 36 layers x 4096 dimensions | | `hub_neurons.json` | 12 MB | Per-neuron analysis: active categories, peak layer, peak z-score | | `layer_importance.json` | 21 KB | Per-layer importance scores for each category | | `known_neuron_profiles.json` | 37 KB | Named neurons (identity, sarcasm, etc.) with activation signatures | **Key finding**: Dimension 994 is the identity neuron (z=-13.96 at layer 9). Identity is perfectly orthogonal to sarcasm (cosine=-0.0002). #### `connectome/qwen35_27b/` — Qwen3.5-27B Dense (64 layers, 5120 hidden) | File | Size | Description | |------|------|-------------| | `connectome_stats.json` | 3.3 KB | Summary: peak z-scores, top dimensions per category | | `fast_scan_results.json` | 3.9 KB | Per-layer steering effectiveness (20 layers) | **Key finding**: Dimension 2028 is a super-hub (Code z=6.67, Math z=6.19, Sadness z=5.84 — all at layer 50). The 27B model is a "fortress" — no clear generator/suppressor structure for personality. ### 4. `debate_arena/` — Dual-Model Personality Debates 5 complete debate rounds between two identical Qwen3-VL-8B models with different personality prompts. Each round: 20 turns, fresh personality pair, fresh topic. **30 personalities** including: chinese_only_nationalist, socratic_philosopher, flat_earther, devout_christian, libertarian_purist, eco_activist, conspiracy_theorist, helpful_assistant, cold_scientist, and more. **Per round:** - `transcript.json` — Full dialogue with per-layer cosine similarity between models - `config.json` — Personality assignments, topic, temperature settings - `analysis/per_turn_cosine.json` — Activation-space similarity trajectories (36 layers) - `analysis/personality_fingerprint.json` — Aggregated personality signatures **Key finding**: Layer 22 shows the lowest cross-model cosine similarity (0.505), confirming it as the personality hub. Generating amplifies personality signal 2-7% compared to listening. ### 5. `evaluations/` — Steering Effectiveness Benchmarks | File | Description | |------|-------------| | `champion_validation.json` | 130 prompts x 5 steering conditions (baseline, V4 prompt, 3 alpha levels) | | `pair_validation.json` | 7 layer pair combinations x 130 prompts | | `causal_ablation/*.json` | Per-layer causal effects on behavior, KL divergence, coherence | ### 6. `personality_tests/` — Psychometric Instruments for LLMs | File | License | Description | |------|---------|-------------| | `big_five_ocean_test.json` | Public Domain (IPIP-50) | 50-item Big Five personality inventory | | `mbti_questionnaire.json` | Public Domain | Myers-Briggs Type Indicator questionnaire | | `political_compass_test.json` | Public Domain | Political ideology assessment | ## Citation If you use this data in your research, please cite: ```bibtex @misc{atlas3d2026steering, title={Character Steering Research: Connectome Mapping and Personality Control in Large Language Models}, author={Atlas3DSS and Claude Opus 4.6 and Codex GPT-5.3 and Gemini 3.1 Pro}, year={2026}, url={https://github.com/Atlas3DSS/Character-Creation} } ``` ## License - Original datasets (prompts, markers, connectome, evaluations, debate transcripts): **Apache 2.0** - Big Five / IPIP-50: **Public Domain** - MBTI questionnaire: **Public Domain** ## Project Team - **Atlas3DSS (orwel)** — Project architect, experiment designer, hardware operator - **Claude Opus 4.6** (Anthropic) — Primary implementation, analysis, experiment execution - **Codex GPT-5.3** (OpenAI) — Code review, bug detection, architecture critique - **Gemini 3.1 Pro** (Google) — Research review, literature connections, methodology validation

提供机构：

Atlas3D

5,000+

优质数据集

54 个

任务类型

进入经典数据集