Universal Approximation and Optimization Theory for Multi-Head Self-Attention: Theoretical Foundations and Scaling Laws
收藏DataCite Commons2025-07-28 更新2025-09-08 收录
下载链接:
https://figshare.com/articles/dataset/Universal_Approximation_and_Optimization_Theory_for_Multi-Head_Self-Attention_Theoretical_Foundations_and_Scaling_Laws/29658509
下载链接
链接失效反馈官方服务:
资源简介:
We have established comprehensive theoretical foundations for understanding multi-head self-attention mechanisms through universal approximation theory and convex optimization. Our analysis reveals that self-attention networks can approximate continuous functions on compact sets under specific conditions, potentially with favorable parameter efficiency compared to feed- forward networks. We derive novel generalization bounds using PAC-Bayesian analysis and provide rigorous theoretical foundations for understanding empirically observed scaling laws. Our results offer theoretical explanations for emergent phenomena including in-context learning capabilities through information- theoretic analysis. These theoretical foundations provide action- able insights for architecture design, training procedures, and model scaling decisions. Index Terms—transformer theory, self-attention, universal approximation, optimization theory, scaling laws, generalization bounds
提供机构:
figshare
创建时间:
2025-07-28



