Somesh Misra / ERP.ai @MathproBro X Profile

Somesh Misra / ERP.ai

@MathproBro

Followers

785

Following

1K

Media

148

Statuses

2K

chief researcher at https://t.co/85QLNI0SE9 | working at the intersection of business processes, neural network topologies & machine learning

https://t.co/85QLNI0SE9

San Francisco, CA

Joined February 2013

Don't wanna be here? Send us removal request.

Somesh Misra / ERP.ai

@MathproBro

23 minutes

Link: https://t.co/zZCaWljlkU One of the authors, @NeelNanda, also has an excellent walkthrough video on YouTube that builds strong mathematical intuition for this framework. IMHO this is the best material to understand transformers!

0

Somesh Misra / ERP.ai

@MathproBro

26 minutes

Transformers aren’t magic. They’re linear algebra with structure. “A Mathematical Framework for Transformer Circuits” by Anthropic’s research team shows how transformers can be decomposed into interpretable end-to-end paths from tokens to logits. Key insight: The residual

1

0

1

Somesh Misra / ERP.ai

@MathproBro

21 hours

- CNNs use translation symmetry. - Graph Neural Networks use permutation symmetry. - Modern equivariant models extend this to rotations, scaling, and more. Group theory is the reason these architectures work, it’s not math obfuscation as people think of it.

0

Somesh Misra / ERP.ai

@MathproBro

24 hours

Group theory is just a formal way to describe “what transformations should not change meaning”. Images do not change if you shift or rotate them. Graphs do not change if you relabel nodes. Physics does not change if you change coordinates. Neural networks that are built to obey

1

2

5

Somesh Misra / ERP.ai

@MathproBro

3 days

@behrouz_ali @mirrokni Why MIRAS matters for deep learning progress: It turns memory and learning into design primitives rather than emergent artifacts. By framing architecture choices as optimization objectives with retention mechanisms, MIRAS empowers new model designs that are more robust,

0

Somesh Misra / ERP.ai

@MathproBro

3 days

@behrouz_ali @mirrokni Diffusion models also fit the associative memory paradigm but in a different way. Instead of discrete key-value stores, they learn a global score/energy landscape over data. - The cue is a noisy version of data, - The “memory” is a learned function that pulls samples toward the

1

0

Somesh Misra / ERP.ai

@MathproBro

3 days

Read the paper here: https://t.co/f9AKHf1Ffc Authors include Ali Behrouz @behrouz_ali , Meisam Razaviyayn, Peilin Zhong, and Vahab Mirrokni @mirrokni. They've reframed many sequence models (Transformers, Titans, etc.) as learned memory systems guided by an internal optimization

arxiv.org

Designing efficient and effective architectural backbones has been in the core of research efforts to enhance the capability of foundation models. Inspired by the human cognitive phenomenon of...

1

0

1

Somesh Misra / ERP.ai

@MathproBro

3 days

Under MIRAS, sequence models are not special or ad-hoc. They are just online optimization routines over memory: • Keys & values are stored and retrieved through an internal loss (attentional bias). • “Forgetting” is a form of regularization (retention). • Memory can be shallow

1

0

2

Somesh Misra / ERP.ai

@MathproBro

3 days

Deep learning sequence models can be seen as associative memory systems that decide what to store, what to forget, and how to retrieve information efficiently. The MIRAS framework from Google Research makes this explicit by defining four core design choices: memory architecture,

1

0

4

Somesh Misra / ERP.ai

@MathproBro

4 days

Here is my group theoretic interpretation of Nested Learning: If each layer and optimizer is an associative memory learning a map x → δ then the entire gradient flow becomes a compression of the group actions present in the data. Inputs live in a representation space V. Errors

0

1

Somesh Misra / ERP.ai

@MathproBro

4 days

3. Attention and RNNs are solutions to internal optimization problems Softmax attention is the non-parametric minimizer of: Σ s(kᵢ, q) ||vᵢ − M||² Linear recurrent models, RWKV, DeltaNet, Titans, etc are gradient-descent solutions to associative memory objectives.

1

0

Somesh Misra / ERP.ai

@MathproBro

4 days

2. Optimizers are also associative memories Momentum, Adam, AdaGrad are two-level memories: inner level compresses gradient sequences outer level updates slow weights Adam is shown to be the optimal associative memory for L₂-regression on gradient history. Optimization itself

1

0

Somesh Misra / ERP.ai

@MathproBro

4 days

Paper link: https://t.co/Gke2eM8zLl 1. Backprop = associative memory The weight update Wₜ₊₁ = Wₜ − η ∇ᵧL(xₜ, yₜ) xₜᵀ is shown to be the solution of the proximal problem: argmin_W ⟨W xₜ, δₜ⟩ + (1 / 2η) ||W − Wₜ||² This means each layer is learning a mapping xₜ

1

0

Somesh Misra / ERP.ai

@MathproBro

4 days

The brilliant Google Reseach paper "Nested Learning: The Illusion of Deep Learning Architecture" proposes a striking idea: Deep learning is not a stack of layers. It is a hierarchy of nested optimization problems, each acting as an associative memory with its own time scale.

2

18

Somesh Misra / ERP.ai

@MathproBro

6 days

The wild part is that memorization is not an optimization failure. It is baked into the objective. With enough capacity, the lowest loss comes from copying the data exactly.

0

Somesh Misra / ERP.ai

@MathproBro

7 days

The striking result: memorization is not an accident or a bug in optimization. It is the true minimum of the loss once the model is sufficiently large. This explains replication, privacy leakage, and why creativity only appears when the model does not have enough capacity to

1

0

Somesh Misra / ERP.ai

@MathproBro

7 days

Here is the link: https://t.co/4kUtF2i8Vp Another thing that the paper might imply is that the crossover point behaves like a measure of the dataset’s intrinsic information dimension. It even implies that small datasets force memorization, that duplicates collapse the effective

arxiv.org

When do diffusion models reproduce their training data, and when are they able to generate samples beyond it? A practically relevant theoretical understanding of this interplay between...

1

0

Somesh Misra / ERP.ai

@MathproBro

8 days

The paper “On the Edge of Memorization in Diffusion Models” uncovers something deeper than it explicitly says. It shows that diffusion models have a built in data compression limit. Generalization happens only while the model is too small to store all training samples. Once

1

Somesh Misra / ERP.ai

@MathproBro

20 days

The paper pulls from serious math: Cantor-style diagonalization to prove inevitable hallucinations, Turing’s halting limits for infinite failure sets, Kolmogorov complexity to show finite models can’t encode high-complexity facts, PAC/VC bounds for long-tail data, and attention

arxiv.org

Large Language Models (LLMs) have benefited enormously from scaling, yet these gains are bounded by five fundamental limitations: (1) hallucination, (2) context compression, (3) reasoning...

0

Somesh Misra / ERP.ai

@MathproBro

20 days

New paper drops a reality check: even trillion-parameter LLMs hit hard mathematical ceilings. - diagonalization means hallucinations are inevitable, - finite info means long-tail facts stay fragile, - and attention geometry means context isn’t really context. Scaling helps,

1

0