Roy Frostig
@froystig
Followers
2K
Following
448
Media
3
Statuses
124
research scientist at @googledeepmind. co-author of JAX (https://t.co/sS9COjJPsx)
sfba
Joined April 2008
Curious how to write SOTA performance Blackwell matmul kernels using MGPU? We just published a short step-by-step tutorial: https://t.co/XRVX34juEz At each step, we show exactly what (small) changes are necessary to refine the kernel and the final kernel is just under 150 lines.
4
67
418
Llama 4 inference in pure JAX! Expert/tensor parallelism with int8 quantization. Contributions welcome!
2
15
135
A nice and concise R1 inference jax:tpu port by @rdyro128523. Good for both reading and running. Watch the repo for more.
Deepseek R1 inference in pure JAX! Currently on TPU, with GPU and distilled models in-progress. Features MLA-style attention, expert/tensor parallelism & int8 quantization. Contributions welcome!
0
7
37
Training our most capable Gemini models relies heavily on our JAX software stack + Google's TPU hardware platforms. If you want to learn more, see this awesome book "How to Scale Your Model": https://t.co/fddEg1OkHN It was put together by my @GoogleDeepMind colleagues
Making LLMs run efficiently can feel scary, but scaling isn’t magic, it’s math! We wanted to demystify the “systems view” of LLMs and wrote a little textbook called “How To Scale Your Model” which we’re releasing today. 1/n
23
165
991
Our online book on systems principles of LLM scaling is live. We hope that it helps you make the most of your computing resources. Enjoy!
Making LLMs run efficiently can feel scary, but scaling isn’t magic, it’s math! We wanted to demystify the “systems view” of LLMs and wrote a little textbook called “How To Scale Your Model” which we’re releasing today. 1/n
0
9
73
We now have a guide to writing distributed communication on TPU using Pallas, written by @JustinFu769512! https://t.co/9TcqNizGV4 Overlapping comms + compute is a crucial performance optimization for large scale ML. Write your own custom overlapped kernels in Python!
4
48
246
Modula x JAX = Modulax @gallabytes is cracked and ported Modula into JAX in a few days. I haven't had a chance to test yet, but I'm really excited about this project. Tagging @froystig and @SingularMattrix
https://t.co/wFpjUH4LCC (1/3)
github.com
Contribute to GallagherCommaJack/modulax development by creating an account on GitHub.
1
4
31
I've finally landed my first proper JAX feature since joining the team: a supported "foreign function interface", which makes it easier to call into external libraries from within JAX code. Check it out:
2
14
98
Finally got around to writing a guide for matrix multiplication on TPUs using Pallas. Check it out!
1
26
157
Many of you are excited about H100 attention, so it’s a good time to show you Mosaic GPU: a Python DSL for H100s. The attention example matches FA3 performance, while being only ~200 lines of Python: https://t.co/12ecz3LftV It's easy to install too! Latest JAX packages have it.
github.com
Composable transformations of Python+NumPy programs: differentiate, vectorize, JIT to GPU/TPU, and more - jax-ml/jax
14
109
665
Excited to share Penzai, a JAX research toolkit from @GoogleDeepMind for building, editing, and visualizing neural networks! Penzai makes it easy to see model internals and lets you inject custom logic anywhere. Check it out on GitHub: https://t.co/mas2uiMqj9
39
396
2K
Built with JAX!
Introducing Gemini 1.0, our most capable and general AI model yet. Built natively to be multimodal, it’s the first step in our Gemini-era of models. Gemini is optimized in three sizes - Ultra, Pro, and Nano Gemini Ultra’s performance exceeds current state-of-the-art results on
5
25
284
Today, I’m excited to announce the release of Levanter 1.0, our new JAX-based framework for training foundation models, which we’ve been working on @StanfordCRFM. Levanter is designed to be legible, scalable and reproducible.
6
85
398
There's a longer history in PL of thinking about linearity in programs, what it means, and what we can do with it (cf. linear types/logic). Hopefully distilling a big piece of AD down to linearizing stuff makes it easier to think about what programs we can differentiate, and how.
1
1
12
To get there, we need to identify linearity *in programs*. For functions expressed in code, we want a notion of linearity that implies mathematical linearity, but that also allows a compiler to delineate and transpose things automatically.
1
0
6
Composing the two, we get reverse-mode AD as you know it today. The implementation is simpler, disentangling perturbation from reversal.
1
0
7
We turn these algebraic facts into algorithms: "linearization" amounts to extracting the linear computation from forward-mode. "Transposition" roughly means reversing that extracted program. These steps can be written separately, like compiler passes.
1
0
5
A key concept is *linearity*. Differentiation forms a linear map—the Jacobian. Forward-mode AD computes that map (aka the "JVP", for Jacobian-vector product). Reverse-mode computes its transpose ("VJP").
1
0
7
We've always done AD this way, and we wrote about it briefly before ( https://t.co/2tBlsmmsiq). The new paper tries to go into more detail by working over a minimal programming language.
1
0
11
In JAX and Dex, we do automatic differentiation (AD) in a distinctive way: by "linearizing" and then "transposing" programs. We wrote up what this looks like in a model language: https://t.co/F2IESHAHHH with Alexey Radul, @apaszke, @SingularMattrix, @DougalMaclaurin
arxiv.org
Automatic differentiation (AD) is conventionally understood as a family of distinct algorithms, rooted in two "modes" -- forward and reverse -- which are typically presented (and implemented)...
1
64
333