five

Semiotic Reflexive Transformer (SRT) - Stage 1 Data

收藏
Zenodo2026-03-05 更新2026-05-26 收录
下载链接:
https://zenodo.org/doi/10.5281/zenodo.18876941
下载链接
链接失效反馈
官方服务:
资源简介:
SRT Stage 1 Validation Data: Description and Documentation Overview This data package contains the synthetic datasets, trained model checkpoints, and validation results for Stage 1 of the Semiotic-Reflexive Transformer (SRT) project. The SRT is a neural architecture that embeds Peircean semiotic decomposition, metapragmatic divergence tracking, and catastrophe-theoretic bifurcation estimation into the transformer computational graph. These datasets were designed to validate that each architectural module learns its intended semiotic function under controlled conditions with known ground-truth signals. All data was generated synthetically with planted divergence signals to enable precise, falsifiable evaluation of four core architectural claims: (1) that the Semiotic Embedding Layer produces distinct, interpretable subspaces corresponding to Peircean categories, (2) that community-conditioned interpretants differentiate contested from neutral terms, (3) that the Metapragmatic Attention Head tracks accumulating meaning divergence across token positions, and (4) that the Bifurcation Estimation Network detects sharp transitions between single-interpretation and dual-interpretation regimes. Dataset Descriptions Dataset A: Binary Community Lexicon Purpose: Tests subspace specialization (Claim 1) and community differentiation (Claim 2). Structure: Two synthetic interpretive communities share a 200-word vocabulary. Twenty words are designated "contested" — their co-occurrence statistics differ systematically between communities. Contested words co-occur with positive-valence terms in Community 0 and negative-valence terms in Community 1. The remaining 79 non-reserved words are "neutral" with identical distributions across communities. Ground truth: Contested words carry rtrue = 0.8; neutral words carry rtrue = 0.0. Samples: 5,000 sequences (4,500 train / 500 validation). Format: JSONL. Each record contains token_ids, community_id, chain_labels, chain_divergence, attractor_labels, and r_true. Dataset B: Gradual Divergence Ramp Purpose: Tests divergence tracking (Claim 3). Structure: Each sequence contains 128 tokens. Divergence increases monotonically across three phases: positions 0–32 are low-divergence (rtrue ≈ 0.0), positions 32–64 show mild divergence (rtrue ramps linearly from 0.0 to 0.5), and positions 64–128 show strong divergence (rtrue ramps from 0.5 to 1.0). The ramp is implemented by interpolating between shared-vocabulary tokens (low divergence) and community-specific tokens (high divergence) as the sequence progresses. Ground truth: Per-position rtrue values following the piecewise linear ramp. Samples: 5,000 sequences (4,500 train / 500 validation). Format: JSONL. Each record contains token_ids, community_id, chain_divergence (per-position float following the ramp), and source identifier. Dataset C: Bifurcation Events Purpose: Tests bifurcation detection (Claim 4). Structure: Each sequence contains a single sharp bifurcation point at a randomly chosen position k. Before position k, all tokens are drawn from a shared vocabulary with uniform co-occurrence statistics (rtrue ≈ 0.0). At position k, a trigger token is inserted, and all subsequent tokens are drawn from community-specific distributions (rtrue ≈ 0.7). Ground truth: Per-position rtrue values (≈ 0.0 before k, ≈ 0.7 after k), plus the bifurcation position k stored in the metadata field. Samples: 5,000 sequences (4,500 train / 500 validation). Format: JSONL. Each record contains token_ids, community_id, chain_divergence, attractor_labels, r_true, and metadata (including bifurcation_position). Model Checkpoint Architecture: SRT at TINY preset (31.6M parameters). Configuration: dmodel = 512, 6 layers, 8 attention heads, dsub = 128, 16 community embeddings, SwiGLU activations, RMSNorm, RoPE positional encoding. Training: 20 epochs on all three datasets jointly, AdamW optimizer with cosine learning rate scheduling, composite loss function (cross-entropy + chain consistency + attractor basin + bifurcation estimation; iconic grounding disabled). Hardware: Apple M-series GPU via MPS backend (~2 hours training time). Format: PyTorch .pt checkpoint files containing model state dict, optimizer state, scheduler state, and training metadata. Validation Results The stage1_results.json file contains the complete quantitative results from Stage 1 validation: Test Metric Result Threshold Status 1.3.1 Subspace Specialization Linear probing margin (min across 4 tasks) 0.155 ≥ 0.15 PASS 1.3.2 Community Differentiation Contested/neutral cosine distance ratio 3.28× ≥ 3.0× PASS 1.3.3 Divergence Tracking Spearman ρ with ground-truth ramp 0.822 ≥ 0.6 PASS 1.3.4 Bifurcation Detection Regime classification accuracy 100.0% ≥ 75% PASS 1.3.4 Bifurcation Detection Mean r̂ difference (post − pre) 0.659 > 0.2 PASS Reproduction To regenerate the datasets from scratch: python scripts/generate_synthetic.py --output data/synthetic --seed 42 Random seed: All data generation uses seed 42 by default for full reproducibility. File Manifest srt_stage1_data_v1.0/ ├── data/synthetic/ │ ├── train/ │ │ ├── dataset_a.jsonl (Binary community lexicon, 4,500 samples) │ │ ├── dataset_b.jsonl (Gradual divergence ramp, 4,500 samples) │ │ ├── dataset_c.jsonl (Bifurcation events, 4,500 samples) │ │ └── combined.jsonl (All training data combined, 13,500 samples) │ └── val/ │ ├── dataset_a.jsonl (500 samples) │ ├── dataset_b.jsonl (500 samples) │ ├── dataset_c.jsonl (500 samples) │ └── combined.jsonl (All validation data combined, 1,500 samples) ├── results/ │ └── stage1_results.json (Complete validation metrics) ├── model_config/ │ └── config.json (TINY preset architecture hyperparameters) ├── scripts/ │ └── generate_synthetic.py (Deterministic dataset generation script) └── DATASHEET.md (Full datasheet per Gebru et al., 2021) Citation If you use this data or the SRT architecture in your research, please cite: Lancaster, J. B. (2026). The Semiotic-Reflexive Transformer: A Neural Architecture for Detecting and Modulating Meaning Divergence Across Interpretive Communities. SSRN Electronic Journal. Repository https://github.com/space-bacon/Semiotic-Reflexive-Transformer Contact Burton Lancaster — Burton@BurtonLancaster.com
提供机构:
Zenodo
创建时间:
2026-03-05
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作