maximuspowers/muat-pca-10-medium
收藏Hugging Face2025-12-06 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/maximuspowers/muat-pca-10-medium
下载链接
链接失效反馈官方服务:
资源简介:
---
language: en
task_categories:
- text-generation
---
# Subject Models for Interpretability Training
These examples are intended for training an interpreter to:
- Identify what patterns a model classifies as positive based on an activation signature, with examples of: trained model + signature → pattern identification.
| Signature Extraction | |
|----------------------|-----------------------------------------------------------------------------|
| Neuron Profile Methods | pca |
| Prompt Format | separate |
| Signature Dataset | configs/dataset_gen/signature_dataset.json |
| Model Architecture | |
|----------------------|-----------------------------------------------------------------------------|
| Number of Layers | 8 to 10 |
| Neurons per Layer | 10 to 15 |
| Activation Types | relu, gelu |
| Pattern Vocab Size | 10 |
| Pattern Sequence Len | 5 |
| Training Datasets | |
|----------------------|-----------------------------------------------------------------------------|
| Enabled Patterns | palindrome, sorted_ascending, sorted_descending, alternating, contains_abc, starts_with, ends_with, no_repeats, has_majority, increasing_pairs, decreasing_pairs, vowel_consonant, first_last_match, mountain_pattern |
| Patterns per Batch | 1-1 |
| Pos/Neg Ratio | 1:1 |
| Target Total Examples per Subject Model | 250 |
| Staged Training | |
|----------------------|-----------------------------------------------------------------------------|
| Min Improvement Threshold | 0.05 (5.0%) |
| Corruption Rate | 0.15 (15.0%) |
## Token Count Statistics
| Task Type | Min Tokens | Max Tokens | Avg Tokens |
|-----------|------------|------------|------------|
| Classification | 11581 | 26103 | 18025.0 |
## Dataset Fields
| Field | Description |
|----------------------|-----------------------------------------------------------------------------|
| example_id | Unique identifier for each example |
| metadata | JSON string containing: |
| | - `target_pattern`: The pattern that was corrupted during training |
| | - `degraded_accuracy`: Accuracy of the model trained on corrupted data |
| | - `improved_accuracy`: Accuracy of the model after training on clean data |
| | - `improvement`: Delta between degraded and improved accuracy |
| | - `model_config`: Subject model architecture and hyperparameters |
| | - `corruption_stats`: Details about label corruption |
| | - `selected_patterns`: All patterns in the subject model's training dataset |
| | - `precision`: Model weight precision |
| | - `quantization`: Quantization type applied to weights |
| | - `config_signature`: Hash of critical config fields for validation |
| classification_prompt | Input prompt with improved model weights and signature |
| classification_completion | Target completion identifying the pattern |
| classification_text | Full concatenated text (prompt + completion) |
language: 英语
task_categories:
- 文本生成(text-generation)
# 用于可解释性训练的主题模型
本数据集示例旨在训练可解释性解析器,以完成以下任务:
- 基于激活特征(activation signature)识别模型将哪些模式判定为正样本,配套示例流程为:已训练模型 + 激活特征 → 模式识别。
| 特征提取项 | 详情 |
|----------------------|-----------------------------------------------------------------------------|
| 神经元特征分析方法 | 主成分分析(pca,Principal Component Analysis) |
| 提示格式 | 分离式(separate) |
| 特征数据集路径 | configs/dataset_gen/signature_dataset.json |
| 模型架构参数 | 详情 |
|----------------------|-----------------------------------------------------------------------------|
| 层数 | 8至10层 |
| 每层神经元数量 | 10至15个 |
| 激活函数类型 | ReLU(relu)、GELU(gelu) |
| 模式词汇表规模 | 10 |
| 模式序列长度 | 5 |
| 训练数据集配置 | 详情 |
|----------------------|-----------------------------------------------------------------------------|
| 启用模式类型 | 回文(palindrome)、升序排序(sorted_ascending)、降序排序(sorted_descending)、交替序列(alternating)、包含abc子串(contains_abc)、以abc开头(starts_with)、以abc结尾(ends_with)、无重复字符(no_repeats)、存在多数字符(has_majority)、递增对序列(increasing_pairs)、递减对序列(decreasing_pairs)、元音辅音交替(vowel_consonant)、首尾字符匹配(first_last_match)、山峰型模式(mountain_pattern) |
| 每批次模式数量 | 1-1(单批次仅包含1种模式) |
| 正负样本比例 | 1:1 |
| 单主题模型目标示例总数 | 250 |
| 分阶段训练配置 | 详情 |
|----------------------|-----------------------------------------------------------------------------|
| 最小准确率提升阈值 | 0.05(即5.0%) |
| 标签污染率 | 0.15(即15.0%) |
## Token(Token)计数统计
| 任务类型 | 最小Token数 | 最大Token数 | 平均Token数 |
|-----------|------------|------------|------------|
| 分类任务(Classification) | 11581 | 26103 | 18025.0 |
## 数据集字段说明
| 字段名 | 说明 |
|----------------------|-----------------------------------------------------------------------------|
| example_id | 每条示例的唯一标识符 |
| metadata | JSON格式字符串,包含以下子字段:<br>- `target_pattern`: 训练阶段被污染的目标模式<br>- `degraded_accuracy`: 基于污染数据训练的模型的准确率<br>- `improved_accuracy`: 基于干净数据微调后的模型准确率<br>- `improvement`: 污染模型与干净模型的准确率差值<br>- `model_config`: 主题模型的架构与超参数<br>- `corruption_stats`: 标签污染详情<br>- `selected_patterns`: 主题模型训练数据集包含的全部模式<br>- `precision`: 模型权重的数值精度<br>- `quantization`: 应用于模型权重的量化类型<br>- `config_signature`: 用于配置验证的关键配置字段哈希值 |
| classification_prompt | 包含优化后模型权重与激活特征的输入提示词 |
| classification_completion | 用于识别目标模式的标准补全输出 |
| classification_text | 提示词与补全结果拼接后的完整分类文本 |
提供机构:
maximuspowers



