ezyang Profile Banner
Edward Z. Yang Profile
Edward Z. Yang

@ezyang

Followers
14K
Following
1K
Media
128
Statuses
9K

I work on PyTorch at Meta. Chatty alt at @difficultyang.

Edison, NJ
Joined May 2008
Don't wanna be here? Send us removal request.
@ezyang
Edward Z. Yang
3 months
I finally sat down and wrote down a post mortem for vibe coding ScubaDuck. It's aimed at those of you who have never tried vibe coding (in its original sense: AI coding without reviewing the code the AI generated)
Tweet media one
7
12
165
@ezyang
Edward Z. Yang
19 hours
On vacation, vibe coding a program that uses XLA to generate a redistribute plan from one shard placement to another, and then reinterprets it with jax.lax:
Tweet card summary image
github.com
Extract redistribution plans from XLA. Contribute to ezyang/xla-redist-ref development by creating an account on GitHub.
2
2
62
@grok
Grok
2 days
What do you want to know?.
121
24
245
@ezyang
Edward Z. Yang
3 days
Opus nails it. It is able to understand under what circumstances blocks can be adjacent (they must be split from larger blocks), and put it together into a correct argument why it's notpossible.
Tweet media one
0
2
26
@ezyang
Edward Z. Yang
3 days
Sonnet does better. It gives the right answer, but the reasoning is incoherent. You can see it realize midway through that it doesn't know why the merge couldn't merge blocks from different allocation streams. It makes up a reason that is close to the truth.
Tweet media one
1
0
8
@ezyang
Edward Z. Yang
3 days
Codex finds the relevant code but engages only superficially with the code site and gives the wrong answer.
Tweet media one
1
0
6
@ezyang
Edward Z. Yang
3 days
Here is a cool code understanding prompt ngimel shared with me: "Look at CUDACachingAllocator.cpp. Sometimes, after a block is freed, `try_merge_blocks` is called. Can `try_merge_blocks` merge blocks that were allocated on the different streams?".
1
2
89
@ezyang
Edward Z. Yang
7 days
The JAX all_to_all docs told me that all_to_all is just transposing a local axis and a device axis and this symbolism has taken over my brain lol.
0
0
22
@ezyang
Edward Z. Yang
7 days
RT @SingularMattrix: @ezyang Slightly streamlined version (did you have some reason for doing explicit device_gets?):. .
Tweet card summary image
gist.github.com
GitHub Gist: instantly share code, notes, and snippets.
0
1
0
@ezyang
Edward Z. Yang
8 days
None of the LLMs can solve it. Can you?.
Tweet card summary image
gist.github.com
GitHub Gist: instantly share code, notes, and snippets.
4
1
14
@ezyang
Edward Z. Yang
8 days
BTW, both Opus and GPT-5 get it wrong.
0
1
3
@ezyang
Edward Z. Yang
8 days
I think this puzzle encapsulates my current all_to_all confusion. Everyone tells me this is possible to do with only one comm and sorry I just don't see it.
1
0
4
@ezyang
Edward Z. Yang
8 days
Given: m = Mesh(devices, ('i', 'j')); x = jax.device_put(jnp.arange(4 * 4).reshape(4, 4), NamedSharding(m, P('i', 'j'))) write f s.t. jax.shard_map(mesh=m, in_specs=P('i', 'j'), out_specs=P(None, ('i', 'j'))), f) is the identity function.
3
1
22
@ezyang
Edward Z. Yang
8 days
I've retitled it State of torch.compile *for training*, because we specifically focus on this aspect of using torch.compile in the post.
1
0
31
@ezyang
Edward Z. Yang
8 days
2
8
60
@ezyang
Edward Z. Yang
8 days
State of torch.compile, August 2025.
Tweet media one
13
68
759
@ezyang
Edward Z. Yang
9 days
Peter Hawkins points out to me that this is controlled by xla_allow_excess_precision, which JAX enables by default. So indeed, torch.compile and JAX/XLA both do the same thing by default.
@ezyang
Edward Z. Yang
11 days
Hey TL, hope you can answer this for me: torch.compile does a thing where if it fuses several bfloat16 operations together, it will do the internal compute in float32 without wasting cycles clamping the intermediates to bfloat16. Does JAX do this by default?.
1
0
22
@ezyang
Edward Z. Yang
9 days
We are writing an improved printer for device mesh and we need a way to summarize device ranges. Please vote.
1
0
4
@ezyang
Edward Z. Yang
10 days
RT @giffmana: @ezyang @tenderizzation @gaunernst Yeah back to the point about big_vision, on TPUs matmuls have always been done on the MXU….
0
1
0
@ezyang
Edward Z. Yang
11 days
Hey TL, hope you can answer this for me: torch.compile does a thing where if it fuses several bfloat16 operations together, it will do the internal compute in float32 without wasting cycles clamping the intermediates to bfloat16. Does JAX do this by default?.
2
0
28
@ezyang
Edward Z. Yang
11 days
RT @GrantSlatton: now that i can use codex with my chatgpt sub, i'm gonna cancel claude max for a month and see if i can get by with gpt5 i….
0
1
0
@ezyang
Edward Z. Yang
11 days
RT @OpenAIDevs: We’re also releasing v0.16 of the Codex CLI today. - GPT-5 is now the default model.- Use with your ChatGPT plan.- A new,….
0
365
0