Generalization Bound for a Shallow Transformer Trained Using Gradient Descent

Dasgupta, Anirban

Generalization Bound for a Shallow Transformer Trained Using Gradient Descent

Source

Transactions on Machine Learning Research

Date Issued

2026-01-01

Author(s)

Mwigo, Brian

Dasgupta, Anirban

Volume

2026 December

Abstract

In this work, we establish a norm-based generalization bound for a shallow Transformer model trained via gradient descent under the bounded-drift (lazy training) regime, where model parameters remain close to their initialization throughout training. Our analysis proceeds in three stages: (a) we formally define a hypothesis class of Transformer models constrained to remain within a small neighborhood of their initialization; (b) we derive an upper bound on the Rademacher complexity of this class, quantifying its effective capacity; and (c) we establish an upper bound on the empirical loss achieved by gradient descent under suitable assumptions on model width, learning rate, and data structure. Combining these results, we obtain a high-probability bound on the true loss that decays sublinearly with the number of training samples N and depends explicitly on model and data parameters. The resulting bound demonstrates that, in the lazy regime, wide and shallow Transformers generalize similarly to their linearized (NTK) counterparts. Empirical evaluations on both text and image datasets support the theoretical findings.

URI

https://repository.iitgn.ac.in/handle/IITG2025/34656