maximuspowers/muat-mean-std-large
收藏Hugging Face2025-12-06 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/maximuspowers/muat-mean-std-large
下载链接
链接失效反馈官方服务:
资源简介:
---
language: en
task_categories:
- text-generation
---
# Subject Models for Interpretability Training
These examples are intended for training an interpreter to:
- Identify what patterns a model classifies as positive based on an activation signature, with examples of: trained model + signature → pattern identification.
| Signature Extraction | |
|----------------------|-----------------------------------------------------------------------------|
| Neuron Profile Methods | mean, std |
| Prompt Format | separate |
| Signature Dataset | configs/dataset_gen/signature_dataset.json |
| Model Architecture | |
|----------------------|-----------------------------------------------------------------------------|
| Number of Layers | 8 to 10 |
| Neurons per Layer | 10 to 15 |
| Activation Types | relu, gelu |
| Pattern Vocab Size | 10 |
| Pattern Sequence Len | 5 |
| Training Datasets | |
|----------------------|-----------------------------------------------------------------------------|
| Enabled Patterns | palindrome, sorted_ascending, sorted_descending, alternating, contains_abc, starts_with, ends_with, no_repeats, has_majority, increasing_pairs, decreasing_pairs, vowel_consonant, first_last_match, mountain_pattern |
| Patterns per Batch | 1-1 |
| Pos/Neg Ratio | 1:1 |
| Target Total Examples per Subject Model | 250 |
| Staged Training | |
|----------------------|-----------------------------------------------------------------------------|
| Min Improvement Threshold | 0.05 (5.0%) |
| Corruption Rate | 0.15 (15.0%) |
## Token Count Statistics
| Task Type | Min Tokens | Max Tokens | Avg Tokens |
|-----------|------------|------------|------------|
| Classification | 7699 | 18864 | 12619.8 |
## Dataset Fields
| Field | Description |
|----------------------|-----------------------------------------------------------------------------|
| example_id | Unique identifier for each example |
| metadata | JSON string containing: |
| | - `target_pattern`: The pattern that was corrupted during training |
| | - `degraded_accuracy`: Accuracy of the model trained on corrupted data |
| | - `improved_accuracy`: Accuracy of the model after training on clean data |
| | - `improvement`: Delta between degraded and improved accuracy |
| | - `model_config`: Subject model architecture and hyperparameters |
| | - `corruption_stats`: Details about label corruption |
| | - `selected_patterns`: All patterns in the subject model's training dataset |
| | - `precision`: Model weight precision |
| | - `quantization`: Quantization type applied to weights |
| | - `config_signature`: Hash of critical config fields for validation |
| classification_prompt | Input prompt with improved model weights and signature |
| classification_completion | Target completion identifying the pattern |
| classification_text | Full concatenated text (prompt + completion) |
提供机构:
maximuspowers



