gmongaras Profile Banner
Gabriel Mongaras Profile
Gabriel Mongaras

@gmongaras

Followers
90
Following
13
Media
12
Statuses
102

Some guy trying to teach rocks how to do things. I also post videos https://t.co/5KpBZvragu

Joined December 2024
Don't wanna be here? Send us removal request.
@gmongaras
Gabriel Mongaras
3 days
Distributed Muon requires a DP gather so that each node can orthogonalize the weight update. I wonder if a local orthogonilalization would be good enough.
Tweet card summary image
arxiv.org
Recently, the Muon optimizer based on matrix orthogonalization has demonstrated strong results in training small-scale language models, but the scalability to larger models has not been proven. We...
1
0
0
@gmongaras
Gabriel Mongaras
8 days
Threw a paper I've been working on onto ArXiv. Trying to get a little closer to understanding why softmax in attention works so well compared to other activation functions.
Tweet card summary image
arxiv.org
Since its introduction, softmax attention has become the backbone of modern transformer architectures due to its expressiveness and scalability across a wide range of tasks. However, the main...
0
4
18
@gmongaras
Gabriel Mongaras
28 days
RT @nrehiew_: Since this figure is going around, the sudden drop corresponds to a very standard learning rate decay and has nothing to do w….
0
8
0
@gmongaras
Gabriel Mongaras
29 days
Hype! New optimizer that prevents unstable attention logits. A problem that good ol QK norm couldn't handle with Muon at scale. Also that loss curve is quite interesting 🤔 .Excited to see what the paper includes when it comes out.
@eliebakouch
elie
29 days
Kimi team just trained a state of the art open source model 32B active parameter/1T total with 0 training instabilities, thanks to MuonClip, this is amazing
Tweet media one
1
0
3
@gmongaras
Gabriel Mongaras
1 month
Surprised this was never tried before. Basically LoRA, except the up projection is the transpose of the down projection. It halves the optimizer state size and apparently it's better than LoRA in terms of accuracy.
arxiv.org
Low-Rank Adaptation (LoRA) has significantly advanced parameter-efficient fine-tuning of large pretrained models. LoRA augments the pre-trained weights of a model by adding the product of two...
0
1
0
@gmongaras
Gabriel Mongaras
1 month
IMO the future of LLMs is sliding (large) window softmax attn (short context, high expressiveness) mixed with RNNs/linear attn (long context, lower expressiveness). When two paradigms exist, something in the middle usually prevails.
huggingface.co
1
0
1
@gmongaras
Gabriel Mongaras
1 month
I think trillinear attention would be a better name (also sounds a lot cooler!).I wonder if flash attention for 2-simplical attention also exists as an extension of this paper?.If I recall correctly, one nice thing about higher order simplicial attn is it avoids oversmoothing.
0
0
1
@gmongaras
Gabriel Mongaras
1 month
Just to be that guy 🤓☝️.Simplices have an orientation and include all lower-order terms (edges and vertices for a 2-simplex). Bodnar has a cool paper on this: .
arxiv.org
Graph representation learning methods have mostly been limited to the modelling of node-wise interactions. Recently, there has been an increased interest in understanding how higher-order...
1
1
1
@gmongaras
Gabriel Mongaras
1 month
Hype! Simplicial attention is finally catching on and is actually computable. Title is fire, though I do think it is a little misleading.
Tweet card summary image
arxiv.org
Recent work has shown that training loss scales as a power law with both model size and the number of tokens, and that achieving compute-optimal models requires scaling model size and token count...
1
0
2
@gmongaras
Gabriel Mongaras
1 month
Cool paper that explores mostly full FP4 training of an LLM. Seems they run into noise issues late in training where grad noise is too high since grads get small. This high noise ratio stalls training, making them switch to BF16 grads later in training.
Tweet card summary image
arxiv.org
We demonstrate, for the first time, fully quantized training (FQT) of large language models (LLMs) using predominantly 4-bit floating-point (FP4) precision for weights, activations, and gradients...
0
0
0
@gmongaras
Gabriel Mongaras
2 months
(or use a perceptual loss which equally seems like something that works, but isn't actually addressing the problem).
0
0
2
@gmongaras
Gabriel Mongaras
2 months
It's weird to me how we know L2 and L1 loss models low frequency info, creating blurry images, and the solution is just you slap a discriminator on the output.
2
0
4
@gmongaras
Gabriel Mongaras
2 months
Log-linear attn is cool, but seems like too much of a hassle for a slight improvement. Rather than adding various features like in old RNNs, I do think higher-order linear attention is the way to go for nearing softmax (something I'm also looking into)!.
Tweet card summary image
arxiv.org
The attention mechanism in Transformers is an important primitive for accurate and scalable sequence modeling. Its quadratic-compute and linear-memory complexity however remain significant...
0
0
2
@gmongaras
Gabriel Mongaras
2 months
Actually this may be really annoying to do if you add text embeddings. May experiment with this if I ever decide to train another diffusion model.
0
0
0
@gmongaras
Gabriel Mongaras
2 months
Seems Flash Attn makes it easy to train on native image resolution with different sized images in a batch. Should be much easier, more efficient, and better to do than masking and/or bucketing.
Tweet card summary image
arxiv.org
We introduce native-resolution image synthesis, a novel generative modeling paradigm that enables the synthesis of images at arbitrary resolutions and aspect ratios. This approach overcomes the...
1
0
1
@gmongaras
Gabriel Mongaras
2 months
One thing about training SD3. It's interesting how increasing resolution makes images a lot better and allows it to add more detail. IMO this is an artifact of the VAE being trained to downsample images rather than being trained to extract the image features directly.
0
0
2
@gmongaras
Gabriel Mongaras
2 months
Went through my SD3 code here. Hoping to get back to normal paper readings next week!.
3
1
4
@gmongaras
Gabriel Mongaras
2 months
I wanted to go over this model but there was very little in the paper to talk about and it's not open sourced :/. Could've been cool to show how the paradigm for speech is shifting away from feature engineering like the Mel Spectogram to model go brrrrrr.
Tweet card summary image
arxiv.org
We introduce MiniMax-Speech, an autoregressive Transformer-based Text-to-Speech (TTS) model that generates high-quality speech. A key innovation is our learnable speaker encoder, which extracts...
0
1
1