Edward Z. Yang @ezyang X Profile

Edward Z. Yang

@ezyang

Followers

14K

Following

1K

Media

128

Statuses

9K

I work on PyTorch at Meta. Chatty alt at @difficultyang.

Edison, NJ

Joined May 2008

Don't wanna be here? Send us removal request.

Edward Z. Yang

@ezyang

3 months

I finally sat down and wrote down a post mortem for vibe coding ScubaDuck. It's aimed at those of you who have never tried vibe coding (in its original sense: AI coding without reviewing the code the AI generated)

7

12

165

Edward Z. Yang

@ezyang

19 hours

On vacation, vibe coding a program that uses XLA to generate a redistribute plan from one shard placement to another, and then reinterprets it with jax.lax:

github.com

Extract redistribution plans from XLA. Contribute to ezyang/xla-redist-ref development by creating an account on GitHub.

2

62

Grok

@grok

2 days

What do you want to know?.

121

24

245

Edward Z. Yang

@ezyang

3 days

Opus nails it. It is able to understand under what circumstances blocks can be adjacent (they must be split from larger blocks), and put it together into a correct argument why it's notpossible.

0

2

26

Edward Z. Yang

@ezyang

3 days

Sonnet does better. It gives the right answer, but the reasoning is incoherent. You can see it realize midway through that it doesn't know why the merge couldn't merge blocks from different allocation streams. It makes up a reason that is close to the truth.

1

0

8

Edward Z. Yang

@ezyang

3 days

Codex finds the relevant code but engages only superficially with the code site and gives the wrong answer.

1

0

6

Edward Z. Yang

@ezyang

3 days

Here is a cool code understanding prompt ngimel shared with me: "Look at CUDACachingAllocator.cpp. Sometimes, after a block is freed, `try_merge_blocks` is called. Can `try_merge_blocks` merge blocks that were allocated on the different streams?".

1

2

89

Edward Z. Yang

@ezyang

7 days

The JAX all_to_all docs told me that all_to_all is just transposing a local axis and a device axis and this symbolism has taken over my brain lol.

0

22

Edward Z. Yang

@ezyang

7 days

RT @SingularMattrix: @ezyang Slightly streamlined version (did you have some reason for doing explicit device_gets?):. .

gist.github.com

GitHub Gist: instantly share code, notes, and snippets.

0

1

0

Edward Z. Yang

@ezyang

8 days

None of the LLMs can solve it. Can you?.

gist.github.com

GitHub Gist: instantly share code, notes, and snippets.

4

1

14

Edward Z. Yang

@ezyang

8 days

BTW, both Opus and GPT-5 get it wrong.

0

1

3

Edward Z. Yang

@ezyang

8 days

I think this puzzle encapsulates my current all_to_all confusion. Everyone tells me this is possible to do with only one comm and sorry I just don't see it.

1

0

4

Edward Z. Yang

@ezyang

8 days

Given: m = Mesh(devices, ('i', 'j')); x = jax.device_put(jnp.arange(4 * 4).reshape(4, 4), NamedSharding(m, P('i', 'j'))) write f s.t. jax.shard_map(mesh=m, in_specs=P('i', 'j'), out_specs=P(None, ('i', 'j'))), f) is the identity function.

3

1

22

Edward Z. Yang

@ezyang

8 days

I've retitled it State of torch.compile *for training*, because we specifically focus on this aspect of using torch.compile in the post.

1

0

31

Edward Z. Yang

@ezyang

8 days

2

8

60

Edward Z. Yang

@ezyang

8 days

State of torch.compile, August 2025.

13

68

759

Edward Z. Yang

@ezyang

9 days

Peter Hawkins points out to me that this is controlled by xla_allow_excess_precision, which JAX enables by default. So indeed, torch.compile and JAX/XLA both do the same thing by default.

Edward Z. Yang

@ezyang

11 days

Hey TL, hope you can answer this for me: torch.compile does a thing where if it fuses several bfloat16 operations together, it will do the internal compute in float32 without wasting cycles clamping the intermediates to bfloat16. Does JAX do this by default?.

1

0

22

Edward Z. Yang

@ezyang

9 days

We are writing an improved printer for device mesh and we need a way to summarize device ranges. Please vote.

1

0

4

Edward Z. Yang

@ezyang

10 days

RT @giffmana: @ezyang @tenderizzation @gaunernst Yeah back to the point about big_vision, on TPUs matmuls have always been done on the MXU….

0

1

0

Edward Z. Yang

@ezyang

11 days

Hey TL, hope you can answer this for me: torch.compile does a thing where if it fuses several bfloat16 operations together, it will do the internal compute in float32 without wasting cycles clamping the intermediates to bfloat16. Does JAX do this by default?.

2

0

28

Edward Z. Yang

@ezyang

11 days

RT @GrantSlatton: now that i can use codex with my chatgpt sub, i'm gonna cancel claude max for a month and see if i can get by with gpt5 i….

0

1

0

Edward Z. Yang

@ezyang

11 days

RT @OpenAIDevs: We’re also releasing v0.16 of the Codex CLI today. - GPT-5 is now the default model.- Use with your ChatGPT plan.- A new,….

0

365

0