Semiotic Reflexive Transformer (SRT) - Stage 1 Data
收藏Zenodo2026-03-05 更新2026-05-26 收录
下载链接:
https://zenodo.org/doi/10.5281/zenodo.18876941
下载链接
链接失效反馈官方服务:
资源简介:
SRT Stage 1 Validation Data: Description and Documentation
Overview
This data package contains the synthetic datasets, trained model checkpoints, and validation results for Stage 1 of the Semiotic-Reflexive Transformer (SRT) project. The SRT is a neural architecture that embeds Peircean semiotic decomposition, metapragmatic divergence tracking, and catastrophe-theoretic bifurcation estimation into the transformer computational graph. These datasets were designed to validate that each architectural module learns its intended semiotic function under controlled conditions with known ground-truth signals.
All data was generated synthetically with planted divergence signals to enable precise, falsifiable evaluation of four core architectural claims: (1) that the Semiotic Embedding Layer produces distinct, interpretable subspaces corresponding to Peircean categories, (2) that community-conditioned interpretants differentiate contested from neutral terms, (3) that the Metapragmatic Attention Head tracks accumulating meaning divergence across token positions, and (4) that the Bifurcation Estimation Network detects sharp transitions between single-interpretation and dual-interpretation regimes.
Dataset Descriptions
Dataset A: Binary Community Lexicon
Purpose: Tests subspace specialization (Claim 1) and community differentiation (Claim 2).
Structure: Two synthetic interpretive communities share a 200-word vocabulary. Twenty words are designated "contested" — their co-occurrence statistics differ systematically between communities. Contested words co-occur with positive-valence terms in Community 0 and negative-valence terms in Community 1. The remaining 79 non-reserved words are "neutral" with identical distributions across communities.
Ground truth: Contested words carry rtrue = 0.8; neutral words carry rtrue = 0.0.
Samples: 5,000 sequences (4,500 train / 500 validation).
Format: JSONL. Each record contains token_ids, community_id, chain_labels, chain_divergence, attractor_labels, and r_true.
Dataset B: Gradual Divergence Ramp
Purpose: Tests divergence tracking (Claim 3).
Structure: Each sequence contains 128 tokens. Divergence increases monotonically across three phases: positions 0–32 are low-divergence (rtrue ≈ 0.0), positions 32–64 show mild divergence (rtrue ramps linearly from 0.0 to 0.5), and positions 64–128 show strong divergence (rtrue ramps from 0.5 to 1.0). The ramp is implemented by interpolating between shared-vocabulary tokens (low divergence) and community-specific tokens (high divergence) as the sequence progresses.
Ground truth: Per-position rtrue values following the piecewise linear ramp.
Samples: 5,000 sequences (4,500 train / 500 validation).
Format: JSONL. Each record contains token_ids, community_id, chain_divergence (per-position float following the ramp), and source identifier.
Dataset C: Bifurcation Events
Purpose: Tests bifurcation detection (Claim 4).
Structure: Each sequence contains a single sharp bifurcation point at a randomly chosen position k. Before position k, all tokens are drawn from a shared vocabulary with uniform co-occurrence statistics (rtrue ≈ 0.0). At position k, a trigger token is inserted, and all subsequent tokens are drawn from community-specific distributions (rtrue ≈ 0.7).
Ground truth: Per-position rtrue values (≈ 0.0 before k, ≈ 0.7 after k), plus the bifurcation position k stored in the metadata field.
Samples: 5,000 sequences (4,500 train / 500 validation).
Format: JSONL. Each record contains token_ids, community_id, chain_divergence, attractor_labels, r_true, and metadata (including bifurcation_position).
Model Checkpoint
Architecture: SRT at TINY preset (31.6M parameters).
Configuration: dmodel = 512, 6 layers, 8 attention heads, dsub = 128, 16 community embeddings, SwiGLU activations, RMSNorm, RoPE positional encoding.
Training: 20 epochs on all three datasets jointly, AdamW optimizer with cosine learning rate scheduling, composite loss function (cross-entropy + chain consistency + attractor basin + bifurcation estimation; iconic grounding disabled).
Hardware: Apple M-series GPU via MPS backend (~2 hours training time).
Format: PyTorch .pt checkpoint files containing model state dict, optimizer state, scheduler state, and training metadata.
Validation Results
The stage1_results.json file contains the complete quantitative results from Stage 1 validation:
Test
Metric
Result
Threshold
Status
1.3.1 Subspace Specialization
Linear probing margin (min across 4 tasks)
0.155
≥ 0.15
PASS
1.3.2 Community Differentiation
Contested/neutral cosine distance ratio
3.28×
≥ 3.0×
PASS
1.3.3 Divergence Tracking
Spearman ρ with ground-truth ramp
0.822
≥ 0.6
PASS
1.3.4 Bifurcation Detection
Regime classification accuracy
100.0%
≥ 75%
PASS
1.3.4 Bifurcation Detection
Mean r̂ difference (post − pre)
0.659
> 0.2
PASS
Reproduction
To regenerate the datasets from scratch:
python scripts/generate_synthetic.py --output data/synthetic --seed 42
Random seed: All data generation uses seed 42 by default for full reproducibility.
File Manifest
srt_stage1_data_v1.0/
├── data/synthetic/
│ ├── train/
│ │ ├── dataset_a.jsonl (Binary community lexicon, 4,500 samples)
│ │ ├── dataset_b.jsonl (Gradual divergence ramp, 4,500 samples)
│ │ ├── dataset_c.jsonl (Bifurcation events, 4,500 samples)
│ │ └── combined.jsonl (All training data combined, 13,500 samples)
│ └── val/
│ ├── dataset_a.jsonl (500 samples)
│ ├── dataset_b.jsonl (500 samples)
│ ├── dataset_c.jsonl (500 samples)
│ └── combined.jsonl (All validation data combined, 1,500 samples)
├── results/
│ └── stage1_results.json (Complete validation metrics)
├── model_config/
│ └── config.json (TINY preset architecture hyperparameters)
├── scripts/
│ └── generate_synthetic.py (Deterministic dataset generation script)
└── DATASHEET.md (Full datasheet per Gebru et al., 2021)
Citation
If you use this data or the SRT architecture in your research, please cite:
Lancaster, J. B. (2026). The Semiotic-Reflexive Transformer: A Neural Architecture
for Detecting and Modulating Meaning Divergence Across Interpretive Communities.
SSRN Electronic Journal.
Repository
https://github.com/space-bacon/Semiotic-Reflexive-Transformer
Contact
Burton Lancaster — Burton@BurtonLancaster.com
提供机构:
Zenodo
创建时间:
2026-03-05



