Back

Gram-Space Manifold Muon

10.13.2025
Ben Keigwin*,  Dhruv Pai,  Nathan Chen
* Core Contributor; Correspondence to ben@tilderesearch.com

Summary

Recently, Bernstein and Thinking Machines introduced a Muon variant that constrains weights to the Stiefel manifold. We recap that construction and propose two variants derived by relaxing the Gram-matrix constraint in two different ways. We argue that designing many manifold-constrained optimizers is naturally phrased in terms of the Gram matrix.

Manifold geometry and tangent flow

Introduction

Modern first-order optimizers can let weight matrices drift into poorly conditioned regimes, amplifying gradient noise and forcing conservative step sizes. Muon mitigates this by orthogonalizing updates, but the weights themselves remain unconstrained. Manifold Muon takes the next step by constraining each linear layer to a geometry—e.g., the Stiefel manifold—so singular values of both the update matrix and the weight matrix are controlled by construction.

In this post, we explore whether this strict constraint can be relaxed. We begin by recapping the original manifold Muon optimizer, then derive two variants by selectively relaxing the constraints on the weight matrix's Gram matrix (WW)(W^\top W). Finally, we propose a single, unified framework for designing Gram-space optimizers.

We show how this framework can systematically generate a family of related manifolds—including the Stiefel manifold and its simple relaxations—which inherit the same efficient dual solution and fast msign\mathrm{msign} computation.

(Manifold) Muon Recap

Muon[1][2] orthogonalizes weight updates to keep linear layers well-conditioned. Recently, Jeremy Bernstein and Thinking Machines released an insightful post[3] in which they introduce a version of the Muon optimizer where the weights of the network are constrained to the Stiefel manifold:

Stiefel(m,n)={WRm×nWW=In}.\text{Stiefel}(m,n) = \lbrace W \in \mathbb{R}^{m \times n} \mid W^\top W = I_n \rbrace.

Given a gradient matrix GG, the "manifold Muon" problem solves

minARm×ntr(GA)s.t.Aspectralη,AW+WA=0,(1)\min_{A \in \mathbb{R}^{m \times n}} \text{tr}(G^\top A) \quad \text{s.t.}\quad \|A\|_{\text{spectral}} \le \eta,\qquad A^\top W + W^\top A = 0, \tag{1}

where the spectral-norm bound controls update size and the second constraint enforces AA to lie in the tangent space

TWStiefel(m,n)={ARm×nAW+WA=0},T_W \text{Stiefel}(m,n) = \lbrace A \in \mathbb{R}^{m \times n} \mid A^\top W + W^\top A = 0 \rbrace,

Note in particular that both the Stiefel constraint and the tangent constraint are natural generalizations of the hyperspherical constraint and its tangent constraint.

As a smooth manifold, we have

dimRStiefel(m,n)=mnn(n+1)2\dim_{\mathbb{R}}\text{Stiefel}(m,n) = mn-\frac{n(n+1)}{2}

and one can think of Stiefel(m,n)\text{Stiefel}(m,n) as the set of orthonormal nn-frames in Rm\mathbb{R}^m.

As nicely detailed in [4], in order to solve problem (1) one introduces a matrix ΛRn×n\Lambda\in \mathbb{R}^{n\times n} and forms the Lagrangian

L(A,Λ)=tr(GA)+tr[Λ(AW+WA)]=A,G+W(Λ+Λ),\begin{aligned} \mathcal{L}(A,\Lambda) &= \text{tr}(G^\top A) + \text{tr}\big[\Lambda^\top (A^\top W + W^\top A)\big] \newline &= \langle A, G + W(\Lambda + \Lambda^\top)\rangle, \end{aligned}

where the angle brackets denote the Frobenius inner product X,Y:=Tr(XY)\langle X, Y \rangle := \text{Tr}(X^\top Y). By a minimax swap and the fact that

arg minAspectralηL(A,Λ)=ηmsign(G+W(Λ+Λ)),\operatorname*{arg\,min}_{\|A\|_{\text{spectral}}\le \eta} \mathcal{L}(A,\Lambda) = -\,\eta\,\text{msign}\big(G + W(\Lambda + \Lambda^\top)\big),

where msign(M)\mathrm{msign}(M) denotes the matrix polar factor: if M=UΣVM = U \Sigma V^\top is an SVD, then msign(M):=UV\mathrm{msign}(M) := U V^\top (zeros on singular-zero directions)1. One then shows that (1) admits the dual problem2:

maxΛ  ηG+W(Λ+Λ)nuclear,\max_{\Lambda}\; -\,\eta\, \|G + W(\Lambda + \Lambda^\top)\|_{\text{nuclear}},

where the nuclear norm is the sum of the singular values. This problem is then solved by gradient ascent, where a subgradient3 of this dual objective is then given by

HStiefel(Λ)=ηΛG+W(Λ+Λ)nuclear=η(WZ+ZW),H_{\text{Stiefel}}(\Lambda) = -\,\eta\,\nabla_{\Lambda}\|G + W(\Lambda + \Lambda^\top)\|_{\text{nuclear}} = -\,\eta\,\big(W^\top Z + Z^\top W\big),

where Z:=msign(G+W(Λ+Λ))Z := \text{msign}\big(G + W(\Lambda + \Lambda^\top)\big).

A couple of remarks about this derivation:

  • Note that although we introduce a full matrix of Lagrange multipliers ΛRn×n\Lambda\in \mathbb{R}^{n\times n}, in fact only sym(Λ)=(Λ+Λ)/2\text{sym}(\Lambda) = (\Lambda+\Lambda^\top)/2 actually mattered, so it would have been sufficient to take Λ\Lambda to be symmetric.
  • When formulating the manifold Muon problem, there are two size constraints: an explicit size constraint on the size of the update matrix AA, and a kind of implicit constraint on the weight matrix WW (namely that it has unit condition number).

In the next section, we will obtain a different manifold Muon-style optimization problem (though the process of solving it will be remarkably similar to the above). To do so, we will relax the unit condition number requirement.

Manifold Muon in Gram Matrix Space

The Stiefel constraint WTW=IW^TW = I is in fact just one constraint that one can put on the Gram matrix of a collection of vectors. Let w1,,wnw_1,\ldots, w_n be vectors in Rm\mathbb{R}^m arranged as columns of WW. The Gram matrix is then

Gram(W)=WTW=[w1Tw1w1Tw2w1Twnw2Tw1w2Tw2w2TwnwnTw1wnTw2wnTwn].\text{Gram}(W) = W^TW = \begin{bmatrix} w_1^T w_1 & w_1^T w_2 & \cdots & w_1^T w_n \\ w_2^T w_1 & w_2^T w_2 & \cdots & w_2^T w_n \\ \vdots & \vdots & \ddots & \vdots \\ w_n^T w_1 & w_n^T w_2 & \cdots & w_n^T w_n \end{bmatrix}.

The Gram matrix Gram(W)\text{Gram}(W) encodes the geometry of the column vectors of WW. Its entries tell us everything about their lengths and the angles between them:

  • The diagonal entries (wiwiw_i^\top w_i) are the squared lengths of each column vector.
  • The off-diagonal entries (wiwjw_i^\top w_j) are the dot products between different columns, measuring their orthogonality.

Using this view, we can equivalently define the Stiefel manifold as (ordered) collections of nn vectors in Rm\mathbb{R}^m whose Gram matrix is the n×nn\times n identity matrix.

With this view in mind, we introduce two relaxations of the Stiefel constraint, each obtained by relaxing one aspect of the identity requirement WTW=IW^TW = I:

  • Diagonal Gram: Require off-diagonal entries to vanish (orthogonality), but allow diagonal entries to vary (non-unit norms).
  • Oblique: Require diagonal entries to equal 1 (unit norms), but allow non-zero off-diagonal entries (non-orthogonality).

Diagonal Gram Manifold Muon

We first consider allowing the Gram matrix WWW^\top W to be an arbitrary diagonal matrix. Define4

DGram(m,n)={WRm×nWW=diag(λ1,,λn),  λi>0}.\text{DGram}(m,n) = \lbrace W \in \mathbb{R}^{m \times n} \mid W^\top W = \text{diag}(\lambda_1,\dots,\lambda_n),\ \ \lambda_i > 0 \rbrace.

Strict positivity of the λi\lambda_i is needed for DGram(m,n)\text{DGram}(m,n) to possess the structure of a smooth manifold. In fact, we have an explicit diffeomorphism

ϕ: Stiefel(m,n)×(0,)nDGram(m,n),ϕ(W,λ)=Wdiag(λ).\phi:\ \text{Stiefel}(m,n)\times (0,\infty)^n \to \text{DGram}(m,n),\qquad \phi(W,\lambda)= W\,\text{diag}(\sqrt{\lambda}).

Moreover, as

dimRStiefel(m,n)=mn12n(n+1)anddimR(0,)n=n,\dim_{\mathbb{R}}\text{Stiefel}(m,n)= mn - \frac{1}{2}n(n+1) \quad \text{and} \quad \dim_{\mathbb{R}} (0,\infty)^n=n,

we obtain

dimRDGram(m,n)=mn12n(n+1)+n=mn12n(n1),\dim_{\mathbb{R}} \text{DGram}(m,n) = mn - \frac{1}{2}n(n+1) + n = mn - \frac{1}{2}n(n-1),

so we gain nn additional degrees of freedom relative to Stiefel(m,n)\text{Stiefel}(m,n).

One can then show that the tangent space at WW is

TWDGram(m,n)={ARm×nOff(AW+WA)=0},T_W \text{DGram}(m,n) = \lbrace A \in \mathbb{R}^{m \times n} \mid \text{Off}(A^\top W + W^\top A) = 0 \rbrace,

where Off()\text{Off}(\cdot) is the operator that projects to the off-diagonal part of a matrix. In other words, AW+WAA^\top W + W^\top A must be diagonal.

For gradient GG and Lagrange multiplier Λ\Lambda, the analogous Lagrangian is then

L(A,Λ)=tr(GA)+tr[ΛOff(AW+WA)]=G,A+Λ,Off(AW+WA)=A,G+2WOff(sym(Λ)).\begin{aligned} \mathcal{L}(A,\Lambda) &= \text{tr}(G^\top A) + \text{tr}\Big[\Lambda^\top\,\text{Off}\big(A^\top W + W^\top A\big)\Big] \newline &= \langle G, A\rangle + \langle \Lambda, \text{Off}(A^\top W + W^\top A)\rangle \newline &= \langle A, G + 2W\,\text{Off}(\text{sym}(\Lambda))\rangle. \end{aligned}

Note by self-adjointness of the Off()\text{Off}(\cdot) operator, it actually suffices to only consider Λ\Lambda symmetric with zero diagonal.

As in the Stiefel case, one then proceeds by solving the analogous dual formulation

maxΛ  ηG+2WOff(sym(Λ))nuclear,\max_{\Lambda}\; -\,\eta\, \|G + 2W\,\text{Off}(\text{sym}(\Lambda))\|_{\text{nuclear}},

and obtains the following subgradient

HDGram(Λ)=ηOff(WZ+ZW),Z:=msign(G+2WOff(sym(Λ))).H_{\mathrm{DGram}}(\Lambda) = -\eta\,\text{Off}\big(W^\top Z + Z^\top W\big),\,\, Z := \text{msign}\big(G+2W\,\text{Off}(\text{sym}(\Lambda))\big).

Note the similarity in form to the original manifold Muon solution.

Oblique Gram Manifold Muon

Alternatively, we relax the orthogonality requirement while maintaining unit norms. Define

Oblique(m,n)={WRm×nDiag(WW)=In},\text{Oblique}(m,n) = \lbrace W \in \mathbb{R}^{m \times n} \mid \text{Diag}(W^\top W) = I_n \rbrace,

where "oblique" comes from the fact that columns have unit norm but can be mutually oblique (not orthogonal).

In fact, Oblique(m,n)\text{Oblique}(m,n) is diffeomorphic to (Sm1)n(S^{m-1})^n by simply mapping each column to a point in Sm1S^{m-1}, and hence the manifold has dimension (m1)n(m-1)n. In this case, the tangent space is given by

TWOblique(m,n)={ARm×nDiag(AW+WA)=0},T_W \text{Oblique}(m,n) = \lbrace A \in \mathbb{R}^{m \times n} \mid \text{Diag}(A^\top W + W^\top A) = 0 \rbrace,

The corresponding Lagrangian is then

L(A,Λ)=tr(GA)+tr[ΛDiag(AW+WA)]=A,G+2WDiag(Λ).\begin{aligned} \mathcal{L}(A,\Lambda) &= \text{tr}(G^\top A) + \text{tr}\Big[\Lambda^\top\,\text{Diag}\big(A^\top W + W^\top A\big)\Big] \newline &= \langle A, G + 2W\,\text{Diag}(\Lambda)\rangle. \end{aligned}

The dual problem takes the same form:

maxΛ  ηG+2WDiag(Λ)nuclear,\max_{\Lambda}\; -\,\eta\, \|G + 2W\,\text{Diag}(\Lambda)\|_{\text{nuclear}},

and we obtain the subgradient

HOblique(Λ)=ηDiag(WZ+ZW),Z:=msign(G+2WDiag(Λ)).H_{\mathrm{Oblique}}(\Lambda) = -\eta\,\text{Diag}\big(W^\top Z + Z^\top W\big),\quad Z := \text{msign}\big(G + 2W\,\text{Diag}(\Lambda)\big).

All layers spectra evolution

A Unifying Theme

In fact each of the three examples above are really instances of the following more general setup.

Let Symn\text{Sym}_n denote the space of n×nn\times n symmetric matrices, let

P:SymnSymnP: \text{Sym}_n\to \text{Sym}_n

be a self-adjoint linear projector (i.e., P2=PP^2=P and P=PP^\top=P), and let CSymn\mathcal{C}\subseteq\text{Sym}_n be some class of symmetric matrices.

Given WRm×nW\in\mathbb{R}^{m\times n}, it follows that Gram(W)\text{Gram}(W) lies in Symn\text{Sym}_n, and P(Gram(W))P(\text{Gram}(W)) lies in Symn\text{Sym}_n.

Then each of the optimization problems above is a special case of the following family of manifolds:

MP,C:={WRm×nP(WWC)=0 for some CC}.\mathcal{M}_{P,\mathcal{C}} := \lbrace W\in\mathbb{R}^{m\times n} \mid P(W^\top W - C)=0 \text{ for some } C\in\mathcal{C} \rbrace.

Some observations:

  • It is sufficient to consider PP to have domain Symn\text{Sym}_n since we could otherwise just precompose with the sym\text{sym} map.
  • For any symmetric XX and n×nn\times n matrix MM, we have M,X=sym(M),X\langle M, X\rangle=\langle\text{sym}(M), X\rangle since skew(M),X=0\langle\text{skew}(M), X\rangle=0 for any MM.

Moreover, each of the components of the above problem have a completely general form5:

Tangent space:

TWMP,C={AP(AW+WA)=0}T_W \mathcal{M}_{P,\mathcal{C}} = \lbrace A \mid P(A^\top W + W^\top A)=0 \rbrace

Lagrangian:

L(A,Λ)=A,G+2WP(sym(Λ))\mathcal{L}(A,\Lambda) = \langle A, G + 2W\,P(\text{sym}(\Lambda))\rangle

Dual problem:

maxΛ  ηG+2WP(sym(Λ))nuclear\max_{\Lambda}\; -\,\eta\,\|G + 2W\,P(\text{sym}(\Lambda))\|_{\text{nuclear}}

Subgradient:

H(Λ)=ηP(WZ+ZW),whereZ:=msign(G+2WP(sym(Λ)))H(\Lambda) = -\eta\,P\big(W^\top Z + Z^\top W\big),\quad\text{where}\quad Z:=\text{msign}\big(G + 2W\,P(\text{sym}(\Lambda))\big)

The derivation shows that for any self-adjoint projector PP, the inner minimization over the spectral-norm ball always yields the same nuclear-norm dual with subgradient Z=msign()Z = \mathrm{msign}(\cdot) (the polar factor), so the exact same Newton–Schulz/Polar-Express routine applies—there is no new solver to invent for each choice of PP.

For example, the three instances above correspond to specifying (P,C)(P,\mathcal{C}) as follows:

  • Stiefel: P=IdP=\mathrm{Id}, C={I}\mathcal{C}=\lbrace I\rbrace.
  • Diagonal-Gram: P=OffP=\text{Off}, C={0}\mathcal{C}=\lbrace 0\rbrace.
  • Oblique: P=DiagP=\text{Diag}, C={I}\mathcal{C}=\lbrace I\rbrace.

In particular, we would offer that it may be easier to search for the ideal constraint manifold by instead contemplating what the correct MP,C\mathcal{M}_{P,\mathcal{C}} set is.

Results

We follow the experiment from the modular manifolds post[3]. We compare four optimizers—Adam[5], Stiefel–Muon, DGram–Muon, and Oblique–Muon—sweeping learning rates over {\{1e-3, 3e-3, 1e-2, 3e-2, 1e-1, 3e-1, 1e1, 1e2}\}.

The figures below summarize train/test accuracy and show the singular-value spectra of the best run per method.

SVD statistics across optimizers Figure 1: SVD statistics across optimizers.

Train accuracy vs learning rate Figure 2: Train accuracy across learning rates.

Test accuracy vs learning rate Figure 3: Test accuracy across learning rates.

The results reveal an interesting spectrum of conditioning behavior. As expected, Stiefel–Muon maintains condition number 1 by construction. The DGram and Oblique variants exhibit more spread in their singular value distributions, yet still maintain substantially better conditioning than Adam.

It would be very interesting to see if this result scales in any way. That is, it is not obvious to us that one always wants all of the singular values of a weight matrix WW to be 1, as is the case for the Stiefel manifold. Adam likely induces far too much variance in the singular value distribution, but allowing slightly more "wiggle room" centered about 1 seems plausible—it could give the model a bit more freedom to privilege certain directions while still maintaining good conditioning. For instance, one could even have a "weight decay"-like term that penalizes deviation from 1.

Loss landscape evolution

Probing the loss landscape of our MLP over time through random projections reveals a similar intuition. Unlike Adam, DGram and Oblique seem to find smoother, more stable valleys in the loss landscape. Compared with the more constrained Stiefel muon, they then descend faster into the minima. Slightly weakening the orthonormality constraint offers a strong balance of regularization (to find flatter basins) and power (to converge quickly when a basin is found).

Takeaways

Optimizer design involves two complementary choices: selecting the right geometry for gradient descent (via an appropriate norm), and choosing the right manifold to constrain weights to.

We showed that the Stiefel Muon optimization problem extends to a broader family of manifolds MP,C\mathcal{M}_{P,\mathcal{C}} parameterized by self-adjoint projectors PP on the space of symmetric matrices and constraint sets C\mathcal{C}. The solution method—passing to a dual problem involving the nuclear norm—remains structurally identical across this family.

This suggests that manifold selection might be more naturally approached through Gram-space design: rather than directly specifying geometric constraints on WW, we can work in the space of Gram matrices and choose which geometric invariants to constrain via PP and C\mathcal{C}.

Acknowledgments

In addition to the post this work is based on, there are some other very[6] nice [7] posts on the topic of manifold-constrained optimizers.

References

  1. Bernstein, Jeremy (2024).
  2. Bernstein, Jeremy (2025).
  3. Bernstein, Jeremy (2025).
  4. Kingma, Diederik P. and Ba, Jimmy (2014).

Footnotes

  1. Equivalently, msign(M)\mathrm{msign}(M) is the polar factor in M=Q(MM)1/2M = Q(M^\top M)^{1/2}; with SVD it reduces to UVU V^\top (with zeros on null singular spaces). In practice we compute msign()\mathrm{msign}(\cdot) efficiently via Newton–Schulz or Polar-Express iterations, avoiding an explicit SVD.

  2. For the full details on the argument, see [4].

  3. One must use subgradients here because the nuclear norm is convex but not everywhere differentiable

  4. In practice you may want bounds on the diagonal entries, e.g. λi[ϵmin,ϵmax]\lambda_i \in [\epsilon_{\min}, \epsilon_{\max}] with 0<ϵminϵmax<0 < \epsilon_{\min} \le \epsilon_{\max} < \infty, to avoid degeneracy and runaway scaling.

  5. One also has a closed form for retraction maps (which depend on C\mathcal{C}) for this general setup.