Back

Regression is All You Need

8.28.2025
Dhruv Pai

Summary

There are many derivations of attention available, but I present my favorite from nonparametric regression. In this vignette, we demonstrate how the Nadaraya-Watson (NW) kernel regressor leads directly to softmax attention under mild assumptions, and pose a few interesting follow-up questions from a kernel smoothing perspective. 1

Back to Basics

To derive attention, we turn to a more classical problem: regression. The objective in vector-valued regression is to fit some function to map from input "x" vectors to response "y" vectors. This function, after being fit, can be queried on new input vectors to sample a new response vector.

To fit our function, we will impose some prior about its form. Let's assume that our model is constant, or formally f(x)=β0f(x) = \beta_0. In other words, for every input we query, the response should be a constant.

The least-squares problem is then:

β^0=arg minβi=1n(yiβ)2\hat{\beta}_0 = \operatorname{arg\,min}_\beta \sum_{i=1}^{n} (y_i - \beta)^2

Which is trivially solved by β^0=1ni=1nyi\hat{\beta}_0 = \frac{1}{n}\sum_{i=1}^{n} y_i, or the mean response value. This is not a particularly interesting regression - and indeed a relatively poor fit as shown below. We observe that for most points, the fit in the local region is much more successful than the global fit.

Constant vs KNN Constant regression does significantly worse than a naive local KNN.

This motivates us to modify our model slightly, and instead assume our model is locally constant under some weighting scheme. We will employ some kernel KK to upweight datapoints close to the queried input and downweight far away datapoints, making our approach akin to smoothing. We will now solve the kernel-weighted least squares problem.

β^0=arg minβi=1nKh(xi,x0)(yiβ)2\hat{\beta}_0 = \operatorname{arg\,min}_\beta \sum_{i=1}^{n} K_h(x_i, x_0)(y_i - \beta)^2

The kernel weight is defined as Kh(xi,x0)=K(xix0h),h>0K_h(x_i, x_0) = K\left(\frac{\|x_i - x_0\|}{h}\right), h>0 for some kernel function KK. 2

Minimizing with respect to β\beta gives

2i=1nKh(xi,x0)(yiβ^0)=0β^0=i=1nKh(xi,x0)yii=1nKh(xi,x0)=i=1nKh(xi,x0)i=1nKh(xi,x0)yi\begin{aligned} -2\sum_{i=1}^{n} K_h(x_i, x_0)(y_i-\hat{\beta}_0) &= 0 \newline \Rightarrow \hat{\beta}_0 &= \frac{\sum_{i=1}^{n} K_h(x_i, x_0) y_i}{\sum_{i=1}^{n} K_h(x_i, x_0)} \newline &= \boxed{\sum_{i=1}^n \frac{K_h(x_i, x_0)}{\sum_{i=1}^n K_h(x_i, x_0)} y_i} \end{aligned}

The RHS of the above equation is referred to as the Nadaraya-Watson (NW) estimator[8][9]. The NW estimator enjoys a rich history in econometrics, statistical modeling, and nonparametric regression theory.

The most common choice of kernel is the Gaussian kernel

K(u)=eu2/2    Kh(xi,x0)=exp(xix022h2)K(u)=e^{-u^2/2} \implies K_h(x_i, x_0)=\exp\left(-\frac{\|x_i - x_0\|^2}{2h^2}\right)

which has the virtue of being isotropic, continuous, and simple.

Now we make the connection to attention explicit by changing our notation. Let us denote our input vectors as keys kk and our response vectors as values vv. We want to sample our regression problem at a query point qq, also a vector living in key space.

To connect with modern attention mechanisms, we make two additional assumptions. First, we employ QK-norm, whereby the queries and keys are normalized to unit norm. Second, we define our kernel bandwidth as the square-root-temperature h=τh=\sqrt{\tau}.

With these substitutions, our kernel becomes:

Kτ(q,k)=exp(qk222τ)=exp(qkT1τ)=exp(1τ)exp(qkTτ)=αexp(qkTτ)\begin{aligned} K_\tau(q,k) &= \exp\left(-\frac{\|q-k\|_2^2}{2\tau}\right) \newline &= \exp\left(\frac{qk^T-1}{\tau}\right) \newline &= \exp\left(-\frac{1}{\tau}\right) \exp\left(\frac{qk^T}{\tau}\right) \newline &= \alpha \exp\left(\frac{qk^T}{\tau}\right) \end{aligned}

Plugging this into the NW estimator, we obtain a familiar culprit.

f(q)=i=1nexp(qkiTτ)i=1nexp(qkiTτ)vi=softmax(QKT)V\boxed{f(q) = \sum_{i=1}^n \frac{\exp(\frac{qk_i^T}{\tau})}{\sum_{i=1}^n \exp(\frac{qk_i^T}{\tau})} v_i = \mathrm{softmax}(QK^T)V}

Thus, we have shown that attention arises as a simple solution to an even simpler nonparametric kernel regression problem. The beautiful setting from which attention intuitively arises is a fascinating result. Gaussian kernel NW regression has been employed for decades in econometrics and related disciplines - truly nothing in ML is ever "new".

It's all you need!

Gaussian NW is quite powerful for fitting arbitrary vector-valued functions, despite being a simple model. Local kernel fits can approximate complex nonlinear functions despite their simple description, with some examples shown below. This flexibility is precisely what makes attention powerful.

Gaussian NW 2d Gaussian NW can approximate a wide variety of functional forms, including sinusoidal, polynomial, and even piecewise.

Contour Gaussian NW fitting a complex 3D surface. Error stabilizes after relatively few samples.

However, attention has some difficulty when the distribution of keys/values suddenly shifts. The old regression points are still embedded in the landscape, and the kernel smoothing weights them accordingly.

Sliding window and positional encoding both help to overcome the issue, by offering a natural recency bias. Here's an example of NW under a distribution shift, with and without sliding window:

Fixed NW Gaussian NW fitting a spline surface under in-context distribution shifts. Old keys/values interfere with the fit.

Sliding NW Sliding window gaussian NW is more robust to distribution shifts, at the cost of poorer expressivity.

As seen above, sliding window is worse at fitting the complex surface but is significantly more robust to distribution shifts. It's intriguing to think about the kinds of complex, high-dimensional surfaces transformers could be fitting in-context and how different architectural choices may impact the learned geometry. As we saw in Sparsity is Cool, the emergent geometry of key manifolds is strongly influenced by the choice of sequence mixer and key parametrization.

Kernels on kernels

We may ask what functional forms emerge when we apply other nonparametric kernel smoothers. Another common choice is the Epanechnikov kernel[7], which simplifies as follows under the unit norm assumption.

Kτ(q,k)=max(0,1qk22τ)=max(0,12τ+2τqkT)K_\tau(q,k)=\max\left(0, 1-\frac{\|q-k\|_2^2}{\tau}\right) = \max\left(0, 1-\frac{2}{\tau}+\frac{2}{\tau}qk^T\right)

The Epanechnikov kernel is often referred to as the optimal kernel in kernel density estimation (KDE) theory because it is proven to be asymptotically MSE-optimal among second-order kernels. Epanechnikov attention (NW with the Epanechnikov kernel) has a number of favorable properties, including:

  • Compact support - a key only contributes if its similarity is above a threshold, allowing higher selectivity. This can reduce interference, such as the GQA noise we identified in Sparsity is Cool. The kernel also naturally introduces sparsity in the attention map. Furthermore, the compact support reduces the effect of outliers.
  • Stable attention logits - by doing away with exp\exp, we alleviate exploding attention logits. We avoid the underflow/overflow problem with exponentials.

We can simplify the kernel by constraining the domain of the bandwidth. Nonnegativity of the second term requires 1qk22τ0τ41-\frac{\|q-k\|_2^2}{\tau} \geq 0 \Rightarrow \tau \geq 4. If the temperature is above this threshold, then the Epanechnikov attention is simply:

f(q)=i=1n12τ+2τqkiTi=1n12τ+2τqkiTvi=i=1nτ21+qkiTn(τ21)+i=1nqkiTvi=1n(τ21)+i=1nqkiTi=1n((τ21)vi+(qkiT)vi)=1Z((τ21)i=1nvi+i=1n(vikiT)q)\begin{aligned} f(q) &= \sum_{i=1}^n \frac{1-\frac{2}{\tau}+\frac{2}{\tau}qk_i^T}{\sum_{i=1}^n 1-\frac{2}{\tau}+\frac{2}{\tau}qk_i^T} v_i \newline &= \sum_{i=1}^n \frac{\frac{\tau}{2}-1+qk_i^T}{n(\frac{\tau}{2}-1)+\sum_{i=1}^n qk_i^T} v_i \newline &= \frac{1}{n(\frac{\tau}{2}-1)+\sum_{i=1}^n qk_i^T}\sum_{i=1}^n \left((\frac{\tau}{2}-1)v_i + (qk_i^T)v_i\right) \newline &= \frac{1}{Z}\left((\frac{\tau}{2}-1)\sum_{i=1}^n v_i + \sum_{i=1}^n (v_ik_i^T)q\right) \end{aligned}

A linear attention variant! After all, by eliminating the max\max we have removed the nonlinearity.

Indeed, if we had instead defined the feature map

ϕ(x)=[12τ2τx]R1+dkKτ(q,k)=ϕ(q)ϕ(k)\phi(x) = \begin{bmatrix} \sqrt{1-\frac{2}{\tau}} \\ \sqrt{\frac{2}{\tau}}x \end{bmatrix} \in \mathbb{R}^{1+d_k} \quad\Rightarrow\quad K_\tau(q,k) = \phi(q)^\top \phi(k)

Then we have the standard linear attention recurrence:

St=St1+ϕ(kt)vt,S0=0St=itϕ(ki)viR(1+dk)×dvyt=ϕ(qt)St\begin{aligned} S_t &= S_{t-1} + \phi(k_t)v_t^\top, \quad S_0=0 \newline S_t &= \sum_{i \le t} \phi(k_i)v_i^\top \in \mathbb{R}^{(1+d_k)\times d_v} \newline y_t &= \phi(q_t)^\top S_t \end{aligned}

The Epanechnikov feature map is a special first-order case of the polynomial feature map, which was developed and applied in the linear attention literature previously[10][11][12]. Assuming bandwidth constraints, Epanechnikov attention admits a slick linear form with a known feature map.

Takeaways

A broad class of test-time learning algorithms can be expressed as vector-valued regressors on key-value space; the main difference is that the entire regression problem is programmed in-context. The general framework offers for the fruitful cross-pollination of many concepts from elementary regression theory to the development of hardware-aligned sequence mixers.

There is a rich literature on the perspective of test-time learning with in-context learned key-value maps[2]. Test-time learners can be described simply by a choice of loss function, optimizer, and regularization (e.g. here we have the kernel-weighted least squares loss, an analytical solver, and no regularization respectively)[3].

References

  1. Sun, Yu and Li, Xinhao and Dalal, Karan and Xu, Jiarui and Vikram, Arjun and Zhang, Genghan and Dubois, Yann and Chen, Xinlei and Wang, Xiaolong and Koyejo, Sanmi and Hashimoto, Tatsunori and Guestrin, Carlos (2024).
  2. Behrouz, Ali and Razaviyayn, Meisam and Zhong, Peilin and Mirrokni, Vahab (2025).
  3. Tsai, Yao-Hung Hubert and Bai, Shaojie and Yamada, Makoto and Morency, Louis-Philippe and Salakhutdinov, Ruslan (2019).
  4. Choromanski, Krzysztof and Likhosherstov, Valerii and Dohan, David and Song, Xingyou and Gane, Andreea and Sarlós, Tamás and Hawkins, Peter and Davis, Jared and Mohiuddin, Afroz and Kaiser, Lukasz and others (2021).
  5. Katharopoulos, Angelos and Vyas, Apoorv and Pappas, Nikolaos and Fleuret, François (2020).
  6. Nadaraya, E. A. (1964).
  7. Smooth Regression Analysis
    Watson, G. S. (1964).
  8. Arora, Simran and Eyuboglu, Sabri and Zhang, Michael and Timalsina, Aman and Alberti, Silas and Zou, James and Rudra, Atri and Re, Christopher (2024).
  9. Kacham, Praneeth and Mirrokni, Vahab and Zhong, Peilin (2024).
  10. Nauen, Tobias Christian and Palacio, Sebastian and Dengel, Andreas (2024).

Footnotes

  1. The connection of the Nadaraya-Watson estimator with attention was first observed in Yu et al[1], though the notion of attention as kernel smoothing existed beforehand and has been studied quite extensively[4][5][6].

  2. hh is referred to as the kernel bandwidth in the regression literature.