froystig Profile Banner
Roy Frostig Profile
Roy Frostig

@froystig

Followers
2K
Following
448
Media
3
Statuses
124

research scientist at @googledeepmind. co-author of JAX (https://t.co/sS9COjJPsx)

sfba
Joined April 2008
Don't wanna be here? Send us removal request.
@apaszke
Adam Paszke
3 months
Curious how to write SOTA performance Blackwell matmul kernels using MGPU? We just published a short step-by-step tutorial: https://t.co/XRVX34juEz At each step, we show exactly what (small) changes are necessary to refine the kernel and the final kernel is just under 150 lines.
4
67
418
@rdyro128523
rdyro
9 months
Llama 4 inference in pure JAX! Expert/tensor parallelism with int8 quantization. Contributions welcome!
2
15
135
@froystig
Roy Frostig
10 months
A nice and concise R1 inference jax:tpu port by @rdyro128523. Good for both reading and running. Watch the repo for more.
@rdyro128523
rdyro
10 months
Deepseek R1 inference in pure JAX! Currently on TPU, with GPU and distilled models in-progress. Features MLA-style attention, expert/tensor parallelism & int8 quantization. Contributions welcome!
0
7
37
@JeffDean
Jeff Dean
11 months
Training our most capable Gemini models relies heavily on our JAX software stack + Google's TPU hardware platforms. If you want to learn more, see this awesome book "How to Scale Your Model": https://t.co/fddEg1OkHN It was put together by my @GoogleDeepMind colleagues
@jacobaustin132
Jacob Austin
11 months
Making LLMs run efficiently can feel scary, but scaling isn’t magic, it’s math! We wanted to demystify the “systems view” of LLMs and wrote a little textbook called “How To Scale Your Model” which we’re releasing today. 1/n
23
165
991
@froystig
Roy Frostig
11 months
Our online book on systems principles of LLM scaling is live. We hope that it helps you make the most of your computing resources. Enjoy!
@jacobaustin132
Jacob Austin
11 months
Making LLMs run efficiently can feel scary, but scaling isn’t magic, it’s math! We wanted to demystify the “systems view” of LLMs and wrote a little textbook called “How To Scale Your Model” which we’re releasing today. 1/n
0
9
73
@sharadvikram
Sharad Vikram
1 year
We now have a guide to writing distributed communication on TPU using Pallas, written by @JustinFu769512! https://t.co/9TcqNizGV4 Overlapping comms + compute is a crucial performance optimization for large scale ML. Write your own custom overlapped kernels in Python!
4
48
246
@jxbz
Jeremy Bernstein
1 year
Modula x JAX = Modulax @gallabytes is cracked and ported Modula into JAX in a few days. I haven't had a chance to test yet, but I'm really excited about this project. Tagging @froystig and @SingularMattrix https://t.co/wFpjUH4LCC (1/3)
Tweet card summary image
github.com
Contribute to GallagherCommaJack/modulax development by creating an account on GitHub.
1
4
31
@exoplaneteer
Dan F-M
1 year
I've finally landed my first proper JAX feature since joining the team: a supported "foreign function interface", which makes it easier to call into external libraries from within JAX code. Check it out:
2
14
98
@sharadvikram
Sharad Vikram
1 year
Finally got around to writing a guide for matrix multiplication on TPUs using Pallas. Check it out!
1
26
157
@apaszke
Adam Paszke
1 year
Many of you are excited about H100 attention, so it’s a good time to show you Mosaic GPU: a Python DSL for H100s. The attention example matches FA3 performance, while being only ~200 lines of Python: https://t.co/12ecz3LftV It's easy to install too! Latest JAX packages have it.
Tweet card summary image
github.com
Composable transformations of Python+NumPy programs: differentiate, vectorize, JIT to GPU/TPU, and more - jax-ml/jax
14
109
665
@_ddjohnson
Daniel Johnson
2 years
Excited to share Penzai, a JAX research toolkit from @GoogleDeepMind for building, editing, and visualizing neural networks! Penzai makes it easy to see model internals and lets you inject custom logic anywhere. Check it out on GitHub: https://t.co/mas2uiMqj9
39
396
2K
@sharadvikram
Sharad Vikram
2 years
Built with JAX!
@sundarpichai
Sundar Pichai
2 years
Introducing Gemini 1.0, our most capable and general AI model yet. Built natively to be multimodal, it’s the first step in our Gemini-era of models. Gemini is optimized in three sizes - Ultra, Pro, and Nano Gemini Ultra’s performance exceeds current state-of-the-art results on
5
25
284
@dlwh
David Hall
3 years
Today, I’m excited to announce the release of Levanter 1.0, our new JAX-based framework for training foundation models, which we’ve been working on @StanfordCRFM. Levanter is designed to be legible, scalable and reproducible.
6
85
398
@froystig
Roy Frostig
4 years
There's a longer history in PL of thinking about linearity in programs, what it means, and what we can do with it (cf. linear types/logic). Hopefully distilling a big piece of AD down to linearizing stuff makes it easier to think about what programs we can differentiate, and how.
1
1
12
@froystig
Roy Frostig
4 years
To get there, we need to identify linearity *in programs*. For functions expressed in code, we want a notion of linearity that implies mathematical linearity, but that also allows a compiler to delineate and transpose things automatically.
1
0
6
@froystig
Roy Frostig
4 years
Composing the two, we get reverse-mode AD as you know it today. The implementation is simpler, disentangling perturbation from reversal.
1
0
7
@froystig
Roy Frostig
4 years
We turn these algebraic facts into algorithms: "linearization" amounts to extracting the linear computation from forward-mode. "Transposition" roughly means reversing that extracted program. These steps can be written separately, like compiler passes.
1
0
5
@froystig
Roy Frostig
4 years
A key concept is *linearity*. Differentiation forms a linear map—the Jacobian. Forward-mode AD computes that map (aka the "JVP", for Jacobian-vector product). Reverse-mode computes its transpose ("VJP").
1
0
7
@froystig
Roy Frostig
4 years
We've always done AD this way, and we wrote about it briefly before ( https://t.co/2tBlsmmsiq). The new paper tries to go into more detail by working over a minimal programming language.
1
0
11
@froystig
Roy Frostig
4 years
In JAX and Dex, we do automatic differentiation (AD) in a distinctive way: by "linearizing" and then "transposing" programs. We wrote up what this looks like in a model language: https://t.co/F2IESHAHHH with Alexey Radul, @apaszke, @SingularMattrix, @DougalMaclaurin
Tweet card summary image
arxiv.org
Automatic differentiation (AD) is conventionally understood as a family of distinct algorithms, rooted in two "modes" -- forward and reverse -- which are typically presented (and implemented)...
1
64
333