Universal Approximation and Optimization Theory for Multi-Head Self-Attention: Theoretical Foundations and Scaling Laws

Name: Universal Approximation and Optimization Theory for Multi-Head Self-Attention: Theoretical Foundations and Scaling Laws
Creator: figshare
Published: 2025-07-28 17:44:01
License: 暂无描述

DataCite Commons2025-07-28 更新2025-09-08 收录

下载链接：

https://figshare.com/articles/dataset/Universal_Approximation_and_Optimization_Theory_for_Multi-Head_Self-Attention_Theoretical_Foundations_and_Scaling_Laws/29658509

下载链接

链接失效反馈

官方服务：

资源简介：

We have established comprehensive theoretical foundations for understanding multi-head self-attention mechanisms through universal approximation theory and convex optimization. Our analysis reveals that self-attention networks can approximate continuous functions on compact sets under specific conditions, potentially with favorable parameter efficiency compared to feed- forward networks. We derive novel generalization bounds using PAC-Bayesian analysis and provide rigorous theoretical foundations for understanding empirically observed scaling laws. Our results offer theoretical explanations for emergent phenomena including in-context learning capabilities through information- theoretic analysis. These theoretical foundations provide action- able insights for architecture design, training procedures, and model scaling decisions. Index Terms—transformer theory, self-attention, universal approximation, optimization theory, scaling laws, generalization bounds

提供机构：

figshare

创建时间：

2025-07-28

5,000+

优质数据集

54 个

任务类型

进入经典数据集