Gabriel Mongaras @gmongaras X Profile

Gabriel Mongaras

@gmongaras

Followers

90

Following

13

Media

12

Statuses

102

Some guy trying to teach rocks how to do things. I also post videos https://t.co/5KpBZvragu

Joined December 2024

Don't wanna be here? Send us removal request.

Gabriel Mongaras

@gmongaras

3 days

Distributed Muon requires a DP gather so that each node can orthogonalize the weight update. I wonder if a local orthogonilalization would be good enough.

arxiv.org

Recently, the Muon optimizer based on matrix orthogonalization has demonstrated strong results in training small-scale language models, but the scalability to larger models has not been proven. We...

1

0

Gabriel Mongaras

@gmongaras

4 days

I feel like this is in response to Genie lol.

deepmind.google

Today we are announcing Genie 3, a general purpose world model that can generate an unprecedented diversity of interactive environments. Given a text prompt, Genie 3 can generate dynamic worlds...

0

Gabriel Mongaras

@gmongaras

4 days

OpenAI finally open 😳.

openai.com

gpt-oss-120b and gpt-oss-20b push the frontier of open-weight reasoning models

1

0

Gabriel Mongaras

@gmongaras

8 days

Threw a paper I've been working on onto ArXiv. Trying to get a little closer to understanding why softmax in attention works so well compared to other activation functions.

arxiv.org

Since its introduction, softmax attention has become the backbone of modern transformer architectures due to its expressiveness and scalability across a wide range of tasks. However, the main...

0

4

18

Gabriel Mongaras

@gmongaras

28 days

RT @nrehiew_: Since this figure is going around, the sudden drop corresponds to a very standard learning rate decay and has nothing to do w….

0

8

0

Gabriel Mongaras

@gmongaras

29 days

Hype! New optimizer that prevents unstable attention logits. A problem that good ol QK norm couldn't handle with Muon at scale. Also that loss curve is quite interesting 🤔 .Excited to see what the paper includes when it comes out.

elie

@eliebakouch

29 days

Kimi team just trained a state of the art open source model 32B active parameter/1T total with 0 training instabilities, thanks to MuonClip, this is amazing

1

0

3

Gabriel Mongaras

@gmongaras

1 month

Surprised this was never tried before. Basically LoRA, except the up projection is the transpose of the down projection. It halves the optimizer state size and apparently it's better than LoRA in terms of accuracy.

arxiv.org

Low-Rank Adaptation (LoRA) has significantly advanced parameter-efficient fine-tuning of large pretrained models. LoRA augments the pre-trained weights of a model by adding the product of two...

0

1

0

Gabriel Mongaras

@gmongaras

1 month

IMO the future of LLMs is sliding (large) window softmax attn (short context, high expressiveness) mixed with RNNs/linear attn (long context, lower expressiveness). When two paradigms exist, something in the middle usually prevails.

huggingface.co

1

0

1

Gabriel Mongaras

@gmongaras

1 month

I think trillinear attention would be a better name (also sounds a lot cooler!).I wonder if flash attention for 2-simplical attention also exists as an extension of this paper?.If I recall correctly, one nice thing about higher order simplicial attn is it avoids oversmoothing.

0

1

Gabriel Mongaras

@gmongaras

1 month

Just to be that guy 🤓☝️.Simplices have an orientation and include all lower-order terms (edges and vertices for a 2-simplex). Bodnar has a cool paper on this: .

arxiv.org

Graph representation learning methods have mostly been limited to the modelling of node-wise interactions. Recently, there has been an increased interest in understanding how higher-order...

1

Gabriel Mongaras

@gmongaras

1 month

Hype! Simplicial attention is finally catching on and is actually computable. Title is fire, though I do think it is a little misleading.

arxiv.org

Recent work has shown that training loss scales as a power law with both model size and the number of tokens, and that achieving compute-optimal models requires scaling model size and token count...

1

0

2

Gabriel Mongaras

@gmongaras

1 month

Cool paper that explores mostly full FP4 training of an LLM. Seems they run into noise issues late in training where grad noise is too high since grads get small. This high noise ratio stalls training, making them switch to BF16 grads later in training.

arxiv.org

We demonstrate, for the first time, fully quantized training (FQT) of large language models (LLMs) using predominantly 4-bit floating-point (FP4) precision for weights, activations, and gradients...

0

Gabriel Mongaras

@gmongaras

2 months

(or use a perceptual loss which equally seems like something that works, but isn't actually addressing the problem).

0

2

Gabriel Mongaras

@gmongaras

2 months

It's weird to me how we know L2 and L1 loss models low frequency info, creating blurry images, and the solution is just you slap a discriminator on the output.

2

0

4

Gabriel Mongaras

@gmongaras

2 months

Log-linear attn is cool, but seems like too much of a hassle for a slight improvement. Rather than adding various features like in old RNNs, I do think higher-order linear attention is the way to go for nearing softmax (something I'm also looking into)!.

arxiv.org

The attention mechanism in Transformers is an important primitive for accurate and scalable sequence modeling. Its quadratic-compute and linear-memory complexity however remain significant...

0

2

Gabriel Mongaras

@gmongaras

2 months

Actually this may be really annoying to do if you add text embeddings. May experiment with this if I ever decide to train another diffusion model.

0

Gabriel Mongaras

@gmongaras

2 months

Seems Flash Attn makes it easy to train on native image resolution with different sized images in a batch. Should be much easier, more efficient, and better to do than masking and/or bucketing.

arxiv.org

We introduce native-resolution image synthesis, a novel generative modeling paradigm that enables the synthesis of images at arbitrary resolutions and aspect ratios. This approach overcomes the...

1

0

1

Gabriel Mongaras

@gmongaras

2 months

One thing about training SD3. It's interesting how increasing resolution makes images a lot better and allows it to add more detail. IMO this is an artifact of the VAE being trained to downsample images rather than being trained to extract the image features directly.

0

2

Gabriel Mongaras

@gmongaras

2 months

Went through my SD3 code here. Hoping to get back to normal paper readings next week!.

3

1

4

Gabriel Mongaras

@gmongaras

2 months

I wanted to go over this model but there was very little in the paper to talk about and it's not open sourced :/. Could've been cool to show how the paradigm for speech is shifting away from feature engineering like the Mel Spectogram to model go brrrrrr.

arxiv.org

We introduce MiniMax-Speech, an autoregressive Transformer-based Text-to-Speech (TTS) model that generates high-quality speech. A key innovation is our learnable speaker encoder, which extracts...

0

1