
Gabriel Mongaras
@gmongaras
Followers
90
Following
13
Media
12
Statuses
102
Some guy trying to teach rocks how to do things. I also post videos https://t.co/5KpBZvragu
Joined December 2024
Distributed Muon requires a DP gather so that each node can orthogonalize the weight update. I wonder if a local orthogonilalization would be good enough.
arxiv.org
Recently, the Muon optimizer based on matrix orthogonalization has demonstrated strong results in training small-scale language models, but the scalability to larger models has not been proven. We...
1
0
0
OpenAI finally open 😳.
openai.com
gpt-oss-120b and gpt-oss-20b push the frontier of open-weight reasoning models
1
0
0
Threw a paper I've been working on onto ArXiv. Trying to get a little closer to understanding why softmax in attention works so well compared to other activation functions.
arxiv.org
Since its introduction, softmax attention has become the backbone of modern transformer architectures due to its expressiveness and scalability across a wide range of tasks. However, the main...
0
4
18
RT @nrehiew_: Since this figure is going around, the sudden drop corresponds to a very standard learning rate decay and has nothing to do w….
0
8
0
Hype! New optimizer that prevents unstable attention logits. A problem that good ol QK norm couldn't handle with Muon at scale. Also that loss curve is quite interesting 🤔 .Excited to see what the paper includes when it comes out.
Kimi team just trained a state of the art open source model 32B active parameter/1T total with 0 training instabilities, thanks to MuonClip, this is amazing
1
0
3
Surprised this was never tried before. Basically LoRA, except the up projection is the transpose of the down projection. It halves the optimizer state size and apparently it's better than LoRA in terms of accuracy.
arxiv.org
Low-Rank Adaptation (LoRA) has significantly advanced parameter-efficient fine-tuning of large pretrained models. LoRA augments the pre-trained weights of a model by adding the product of two...
0
1
0
IMO the future of LLMs is sliding (large) window softmax attn (short context, high expressiveness) mixed with RNNs/linear attn (long context, lower expressiveness). When two paradigms exist, something in the middle usually prevails.
huggingface.co
1
0
1
Just to be that guy 🤓☝️.Simplices have an orientation and include all lower-order terms (edges and vertices for a 2-simplex). Bodnar has a cool paper on this: .
arxiv.org
Graph representation learning methods have mostly been limited to the modelling of node-wise interactions. Recently, there has been an increased interest in understanding how higher-order...
1
1
1
Hype! Simplicial attention is finally catching on and is actually computable. Title is fire, though I do think it is a little misleading.
arxiv.org
Recent work has shown that training loss scales as a power law with both model size and the number of tokens, and that achieving compute-optimal models requires scaling model size and token count...
1
0
2
Cool paper that explores mostly full FP4 training of an LLM. Seems they run into noise issues late in training where grad noise is too high since grads get small. This high noise ratio stalls training, making them switch to BF16 grads later in training.
arxiv.org
We demonstrate, for the first time, fully quantized training (FQT) of large language models (LLMs) using predominantly 4-bit floating-point (FP4) precision for weights, activations, and gradients...
0
0
0
Log-linear attn is cool, but seems like too much of a hassle for a slight improvement. Rather than adding various features like in old RNNs, I do think higher-order linear attention is the way to go for nearing softmax (something I'm also looking into)!.
arxiv.org
The attention mechanism in Transformers is an important primitive for accurate and scalable sequence modeling. Its quadratic-compute and linear-memory complexity however remain significant...
0
0
2
Seems Flash Attn makes it easy to train on native image resolution with different sized images in a batch. Should be much easier, more efficient, and better to do than masking and/or bucketing.
arxiv.org
We introduce native-resolution image synthesis, a novel generative modeling paradigm that enables the synthesis of images at arbitrary resolutions and aspect ratios. This approach overcomes the...
1
0
1
I wanted to go over this model but there was very little in the paper to talk about and it's not open sourced :/. Could've been cool to show how the paradigm for speech is shifting away from feature engineering like the Mel Spectogram to model go brrrrrr.
arxiv.org
We introduce MiniMax-Speech, an autoregressive Transformer-based Text-to-Speech (TTS) model that generates high-quality speech. A key innovation is our learnable speaker encoder, which extracts...
0
1
1