@abhi_venigalla
Abhi Venigalla
1 year
And yes, you can switch back and forth between NVIDIA and AMD, even within a single training run. It's Christmas in July!πŸŽ„
Tweet media one
9
46
424

Replies

@abhi_venigalla
Abhi Venigalla
1 year
Ready for GPU independence weekend? PyTorch 2.0 and LLM Foundry now work out of the box on ** AMD GPUs! ** We profiled MPT 1B-13B models on AMD MI250 and saw perf within 80% of A100-40GB, which could go up to 94% with better software. It. Just. Works.
23
215
1K
@abhi_venigalla
Abhi Venigalla
1 year
Here's MPT-1B training for 1B tokens on NVIDIA A100 (green) vs. AMD MI250 (red). Can you spot a difference? Both runs use the exact same LLM Foundry code:
Tweet media one
Tweet media two
1
3
65
@abhi_venigalla
Abhi Venigalla
1 year
If we zoom in on the first 100 batches, we get nearly overlapping loss curves. This is crazy given that the runs are on two totally different hardware stacks! StreamingDataset and Composer do a lot heavy lifting for determinism in the dataloader and train loop.
Tweet media one
1
2
54
@abhi_venigalla
Abhi Venigalla
1 year
What about perf? We only had 1 node of 4xMI250, so to compare with our 8xA100 systems we measured per-GPU metrics. With no code changes, perf on MI250 looks really strong! About 80% of A100-40GB. Better FlashAttention for AMD may close the gap (we predict ~94% of A100-40GB)
Tweet media one
2
11
73
@abhi_venigalla
Abhi Venigalla
1 year
This is all made possible thanks to a software and hardware stack that AMD has been building for years, and is now bearing fruit. Seeing MI250 work so well today brings hope that the MI300x will too when it arrives!
Tweet media one
1
6
71
@abhi_venigalla
Abhi Venigalla
1 year
One fun tidbit -- yes with PyTorch you still run `torch.cuda` on AMD systems and yes it does workπŸ˜†
Tweet media one
2
6
99
@abhi_venigalla
Abhi Venigalla
1 year
For more projections on (MI250, MI300x) vs. (A100, H100) check out @dylan522p 's companion blog here:
2
5
48
@abhi_venigalla
Abhi Venigalla
1 year
Overall, I'm super optimistic about the future for AI hardware. More options means more compute supply, more market pressure on prices, and lower costs for users. If your hardware supports PyTorch 2.0 too ( @HabanaLabs ???) reach out to us and we would love to showcase it!
1
2
48
@abhi_venigalla
Abhi Venigalla
1 year
Keep an eye on LLM Foundry where we will add pre-built Docker images with ROCm FlashAttention to make the AMD setup process even faster. We'll also be profiling MPT on larger MI250 clusters soon! Lastly, @LisaSu any chance we can get early access to MI300x? πŸ™
0
3
59
@lookfirst
1️⃣ πŸ‘πŸ‘
11 months
@abhi_venigalla Hey Abhi, great blog post on the Mosaic site. Have you run the numbers with ROCm 5.6 yet?
1
0
0
@abhi_venigalla
Abhi Venigalla
11 months
@lookfirst Not yet but working on it!
1
0
1
@vitaliychiley
Vitaliy Chiley
1 year
@abhi_venigalla Christmas???
Tweet media one
2
1
47
@abhi_venigalla
Abhi Venigalla
1 year
@vitaliychiley LOL you madman
1
0
8
@MasterScrat
Florian Laurent
1 year
@abhi_venigalla Can we see the same plot with total runtime as X axis?
1
0
0
@abhi_venigalla
Abhi Venigalla
1 year
@MasterScrat Sure here's two plots, easy to do with Composer :) Left: loss vs. hours Right: hours vs. step This is 4xMI250 vs. 8xA100-40GB, so the red segments are running at ~0.4x the speed of the green segments The segments overlap a bit b/c I was manually killing and resuming runs.
Tweet media one
Tweet media two
0
0
11
@AlexanderDerve
Alexander Derve
1 year
@abhi_venigalla that's wild
0
0
0
@spatialneuron
@spatial
1 year
@abhi_venigalla We’re so fucking back
0
0
0
@abhi_venigalla Please delete this tweet and let gamers buy cheap AMD cards from cash strapped scalpers and miners for the summer at least
0
0
2
@6___0
catsNstuff
8 months
@abhi_venigalla how about the RDNA3 RETAIL models 7xxx lineup?
Tweet media one
0
0
0