Roy Frostig @froystig X Profile

Roy Frostig

@froystig

Followers

2K

Following

448

Media

3

Statuses

124

research scientist at @googledeepmind. co-author of JAX (https://t.co/sS9COjJPsx)

https://t.co/S4GKmbFfJB

sfba

Joined April 2008

Don't wanna be here? Send us removal request.

Adam Paszke

@apaszke

3 months

Curious how to write SOTA performance Blackwell matmul kernels using MGPU? We just published a short step-by-step tutorial: https://t.co/XRVX34juEz At each step, we show exactly what (small) changes are necessary to refine the kernel and the final kernel is just under 150 lines.

4

67

418

rdyro

@rdyro128523

9 months

Llama 4 inference in pure JAX! Expert/tensor parallelism with int8 quantization. Contributions welcome!

2

15

135

Roy Frostig

@froystig

10 months

A nice and concise R1 inference jax:tpu port by @rdyro128523. Good for both reading and running. Watch the repo for more.

rdyro

@rdyro128523

10 months

Deepseek R1 inference in pure JAX! Currently on TPU, with GPU and distilled models in-progress. Features MLA-style attention, expert/tensor parallelism & int8 quantization. Contributions welcome!

0

7

37

Jeff Dean

@JeffDean

11 months

Training our most capable Gemini models relies heavily on our JAX software stack + Google's TPU hardware platforms. If you want to learn more, see this awesome book "How to Scale Your Model": https://t.co/fddEg1OkHN It was put together by my @GoogleDeepMind colleagues

Jacob Austin

@jacobaustin132

11 months

Making LLMs run efficiently can feel scary, but scaling isn’t magic, it’s math! We wanted to demystify the “systems view” of LLMs and wrote a little textbook called “How To Scale Your Model” which we’re releasing today. 1/n

23

165

991

Roy Frostig

@froystig

11 months

Our online book on systems principles of LLM scaling is live. We hope that it helps you make the most of your computing resources. Enjoy!

Jacob Austin

@jacobaustin132

11 months

Making LLMs run efficiently can feel scary, but scaling isn’t magic, it’s math! We wanted to demystify the “systems view” of LLMs and wrote a little textbook called “How To Scale Your Model” which we’re releasing today. 1/n

0

9

73

Sharad Vikram

@sharadvikram

1 year

We now have a guide to writing distributed communication on TPU using Pallas, written by @JustinFu769512! https://t.co/9TcqNizGV4 Overlapping comms + compute is a crucial performance optimization for large scale ML. Write your own custom overlapped kernels in Python!

4

48

246

Jeremy Bernstein

@jxbz

1 year

Modula x JAX = Modulax @gallabytes is cracked and ported Modula into JAX in a few days. I haven't had a chance to test yet, but I'm really excited about this project. Tagging @froystig and @SingularMattrix https://t.co/wFpjUH4LCC (1/3)

github.com

Contribute to GallagherCommaJack/modulax development by creating an account on GitHub.

1

4

31

Dan F-M

@exoplaneteer

1 year

I've finally landed my first proper JAX feature since joining the team: a supported "foreign function interface", which makes it easier to call into external libraries from within JAX code. Check it out:

2

14

98

Sharad Vikram

@sharadvikram

1 year

Finally got around to writing a guide for matrix multiplication on TPUs using Pallas. Check it out!

1

26

157

Adam Paszke

@apaszke

1 year

Many of you are excited about H100 attention, so it’s a good time to show you Mosaic GPU: a Python DSL for H100s. The attention example matches FA3 performance, while being only ~200 lines of Python: https://t.co/12ecz3LftV It's easy to install too! Latest JAX packages have it.

github.com

Composable transformations of Python+NumPy programs: differentiate, vectorize, JIT to GPU/TPU, and more - jax-ml/jax

14

109

665

Daniel Johnson

@_ddjohnson

2 years

Excited to share Penzai, a JAX research toolkit from @GoogleDeepMind for building, editing, and visualizing neural networks! Penzai makes it easy to see model internals and lets you inject custom logic anywhere. Check it out on GitHub: https://t.co/mas2uiMqj9

39

396

2K

Sharad Vikram

@sharadvikram

2 years

Built with JAX!

Sundar Pichai

@sundarpichai

2 years

Introducing Gemini 1.0, our most capable and general AI model yet. Built natively to be multimodal, it’s the first step in our Gemini-era of models. Gemini is optimized in three sizes - Ultra, Pro, and Nano Gemini Ultra’s performance exceeds current state-of-the-art results on

5

25

284

David Hall

@dlwh

3 years

Today, I’m excited to announce the release of Levanter 1.0, our new JAX-based framework for training foundation models, which we’ve been working on @StanfordCRFM. Levanter is designed to be legible, scalable and reproducible.

6

85

398

Roy Frostig

@froystig

4 years

There's a longer history in PL of thinking about linearity in programs, what it means, and what we can do with it (cf. linear types/logic). Hopefully distilling a big piece of AD down to linearizing stuff makes it easier to think about what programs we can differentiate, and how.

1

12

Roy Frostig

@froystig

4 years

To get there, we need to identify linearity *in programs*. For functions expressed in code, we want a notion of linearity that implies mathematical linearity, but that also allows a compiler to delineate and transpose things automatically.

1

0

6

Roy Frostig

@froystig

4 years

Composing the two, we get reverse-mode AD as you know it today. The implementation is simpler, disentangling perturbation from reversal.

1

0

7

Roy Frostig

@froystig

4 years

We turn these algebraic facts into algorithms: "linearization" amounts to extracting the linear computation from forward-mode. "Transposition" roughly means reversing that extracted program. These steps can be written separately, like compiler passes.

1

0

5

Roy Frostig

@froystig

4 years

A key concept is *linearity*. Differentiation forms a linear map—the Jacobian. Forward-mode AD computes that map (aka the "JVP", for Jacobian-vector product). Reverse-mode computes its transpose ("VJP").

1

0

7

Roy Frostig

@froystig

4 years

We've always done AD this way, and we wrote about it briefly before ( https://t.co/2tBlsmmsiq). The new paper tries to go into more detail by working over a minimal programming language.

1

0

11

Roy Frostig

@froystig

4 years

In JAX and Dex, we do automatic differentiation (AD) in a distinctive way: by "linearizing" and then "transposing" programs. We wrote up what this looks like in a model language: https://t.co/F2IESHAHHH with Alexey Radul, @apaszke, @SingularMattrix, @DougalMaclaurin

arxiv.org

Automatic differentiation (AD) is conventionally understood as a family of distinct algorithms, rooted in two "modes" -- forward and reverse -- which are typically presented (and implemented)...

1

64

333