Somesh Misra / ERP.ai
@MathproBro
Followers
785
Following
1K
Media
148
Statuses
2K
chief researcher at https://t.co/85QLNI0SE9 | working at the intersection of business processes, neural network topologies & machine learning
San Francisco, CA
Joined February 2013
Link: https://t.co/zZCaWljlkU One of the authors, @NeelNanda, also has an excellent walkthrough video on YouTube that builds strong mathematical intuition for this framework. IMHO this is the best material to understand transformers!
0
0
0
Transformers aren’t magic. They’re linear algebra with structure. “A Mathematical Framework for Transformer Circuits” by Anthropic’s research team shows how transformers can be decomposed into interpretable end-to-end paths from tokens to logits. Key insight: The residual
1
0
1
- CNNs use translation symmetry. - Graph Neural Networks use permutation symmetry. - Modern equivariant models extend this to rotations, scaling, and more. Group theory is the reason these architectures work, it’s not math obfuscation as people think of it.
0
0
0
Group theory is just a formal way to describe “what transformations should not change meaning”. Images do not change if you shift or rotate them. Graphs do not change if you relabel nodes. Physics does not change if you change coordinates. Neural networks that are built to obey
1
2
5
@behrouz_ali @mirrokni Why MIRAS matters for deep learning progress: It turns memory and learning into design primitives rather than emergent artifacts. By framing architecture choices as optimization objectives with retention mechanisms, MIRAS empowers new model designs that are more robust,
0
0
0
@behrouz_ali @mirrokni Diffusion models also fit the associative memory paradigm but in a different way. Instead of discrete key-value stores, they learn a global score/energy landscape over data. - The cue is a noisy version of data, - The “memory” is a learned function that pulls samples toward the
1
0
0
Read the paper here: https://t.co/f9AKHf1Ffc Authors include Ali Behrouz @behrouz_ali , Meisam Razaviyayn, Peilin Zhong, and Vahab Mirrokni @mirrokni. They've reframed many sequence models (Transformers, Titans, etc.) as learned memory systems guided by an internal optimization
arxiv.org
Designing efficient and effective architectural backbones has been in the core of research efforts to enhance the capability of foundation models. Inspired by the human cognitive phenomenon of...
1
0
1
Under MIRAS, sequence models are not special or ad-hoc. They are just online optimization routines over memory: • Keys & values are stored and retrieved through an internal loss (attentional bias). • “Forgetting” is a form of regularization (retention). • Memory can be shallow
1
0
2
Deep learning sequence models can be seen as associative memory systems that decide what to store, what to forget, and how to retrieve information efficiently. The MIRAS framework from Google Research makes this explicit by defining four core design choices: memory architecture,
1
0
4
Here is my group theoretic interpretation of Nested Learning: If each layer and optimizer is an associative memory learning a map x → δ then the entire gradient flow becomes a compression of the group actions present in the data. Inputs live in a representation space V. Errors
0
0
1
3. Attention and RNNs are solutions to internal optimization problems Softmax attention is the non-parametric minimizer of: Σ s(kᵢ, q) ||vᵢ − M||² Linear recurrent models, RWKV, DeltaNet, Titans, etc are gradient-descent solutions to associative memory objectives.
1
0
0
2. Optimizers are also associative memories Momentum, Adam, AdaGrad are two-level memories: inner level compresses gradient sequences outer level updates slow weights Adam is shown to be the optimal associative memory for L₂-regression on gradient history. Optimization itself
1
0
0
Paper link: https://t.co/Gke2eM8zLl 1. Backprop = associative memory The weight update Wₜ₊₁ = Wₜ − η ∇ᵧL(xₜ, yₜ) xₜᵀ is shown to be the solution of the proximal problem: argmin_W ⟨W xₜ, δₜ⟩ + (1 / 2η) ||W − Wₜ||² This means each layer is learning a mapping xₜ
1
0
0
The brilliant Google Reseach paper "Nested Learning: The Illusion of Deep Learning Architecture" proposes a striking idea: Deep learning is not a stack of layers. It is a hierarchy of nested optimization problems, each acting as an associative memory with its own time scale.
2
2
18
The wild part is that memorization is not an optimization failure. It is baked into the objective. With enough capacity, the lowest loss comes from copying the data exactly.
0
0
0
The striking result: memorization is not an accident or a bug in optimization. It is the true minimum of the loss once the model is sufficiently large. This explains replication, privacy leakage, and why creativity only appears when the model does not have enough capacity to
1
1
0
Here is the link: https://t.co/4kUtF2i8Vp Another thing that the paper might imply is that the crossover point behaves like a measure of the dataset’s intrinsic information dimension. It even implies that small datasets force memorization, that duplicates collapse the effective
arxiv.org
When do diffusion models reproduce their training data, and when are they able to generate samples beyond it? A practically relevant theoretical understanding of this interplay between...
1
1
0
The paper “On the Edge of Memorization in Diffusion Models” uncovers something deeper than it explicitly says. It shows that diffusion models have a built in data compression limit. Generalization happens only while the model is too small to store all training samples. Once
1
1
1
The paper pulls from serious math: Cantor-style diagonalization to prove inevitable hallucinations, Turing’s halting limits for infinite failure sets, Kolmogorov complexity to show finite models can’t encode high-complexity facts, PAC/VC bounds for long-tail data, and attention
arxiv.org
Large Language Models (LLMs) have benefited enormously from scaling, yet these gains are bounded by five fundamental limitations: (1) hallucination, (2) context compression, (3) reasoning...
0
0
0
New paper drops a reality check: even trillion-parameter LLMs hit hard mathematical ceilings. - diagonalization means hallucinations are inevitable, - finite info means long-tail facts stay fragile, - and attention geometry means context isn’t really context. Scaling helps,
1
0
0