Arjun Devraj
@arjun_devraj_
Followers
26
Following
221
Media
0
Statuses
8
PhD student @cornell_cs. Previously: SWE @meta, undergrad @princeton
Joined September 2023
⭐️ On an 8-GPU NVSwitched server, StragglAR speeds up AllReduce for larger buffers by 22% over Ring, a result that should only improve with more GPUs. StragglAR also reliably reduces end-to-end iteration time during data-parallel finetuning of Llama-3.2-3B (see our paper)!
0
0
2
📈 StragglAR’s performance advantage *increases* as the GPU cluster size scales, and it asymptotically achieves the lowest known bandwidth cost among all algorithms under straggler conditions! We do this by reformulating AllReduce as an efficient broadcast with n-2+log n rounds.
1
0
2
💡Once the straggler reaches the synchronization barrier, StragglAR implements a fast, novel collective algorithm to complete the AllReduce. When the initial ReduceScatter is fully overlapped with the straggler delay, this results in (provably) 2x lower communication cost!
1
0
2
🏁 Instead of allowing other GPUs to idle (while waiting for the straggler) before starting the AllReduce, our algorithm—StragglAR—uses the delay to perform useful communication. With StragglAR, non-straggler GPUs complete a ReduceScatter while waiting for the straggler.
1
0
2
🐌 Persistent straggler GPUs delay AllReduce. Distributed ML jobs that use data or tensor parallelism are bottlenecked by AllReduce to communicate gradients/activations in training and inference. We find a *persistent* straggler delays AllReduce in multi-GPU training experiments.
1
0
2
Excited to share our preprint: Accelerating AllReduce with a Persistent Straggler 🚀 w/ Eric Ding, @nth_abhishek, Robert Kleinberg, @rachee_singh We design a new algorithm to speed up AllReduce in distributed ML jobs with a persistent straggler GPU 🧵⬇️ https://t.co/VaGaLDCDvY
1
3
12
@nth_abhishek presents our work on server-scale photonic interconnects at @ACMSIGCOMM HotNets! Thanks Sujata for chairing the session!
0
4
16
I am hiring a postdoc at Cornell for systems research on next-generation multi-GPU interconnects. If you are about to graduate with a PhD in CS or a related field, email me at rachee@cs.cornell.edu with your CV and a representative publication.
1
35
113