Arjun Devraj @arjun_devraj_ X Profile

Arjun Devraj

@arjun_devraj_

Followers

26

Following

221

Media

0

Statuses

8

PhD student @cornell_cs. Previously: SWE @meta, undergrad @princeton

https://t.co/6mReeTLqV7

Joined September 2023

Don't wanna be here? Send us removal request.

Arjun Devraj

@arjun_devraj_

7 months

⭐️ On an 8-GPU NVSwitched server, StragglAR speeds up AllReduce for larger buffers by 22% over Ring, a result that should only improve with more GPUs. StragglAR also reliably reduces end-to-end iteration time during data-parallel finetuning of Llama-3.2-3B (see our paper)!

0

2

Arjun Devraj

@arjun_devraj_

7 months

📈 StragglAR’s performance advantage *increases* as the GPU cluster size scales, and it asymptotically achieves the lowest known bandwidth cost among all algorithms under straggler conditions! We do this by reformulating AllReduce as an efficient broadcast with n-2+log n rounds.

1

0

2

Arjun Devraj

@arjun_devraj_

7 months

💡Once the straggler reaches the synchronization barrier, StragglAR implements a fast, novel collective algorithm to complete the AllReduce. When the initial ReduceScatter is fully overlapped with the straggler delay, this results in (provably) 2x lower communication cost!

1

0

2

Arjun Devraj

@arjun_devraj_

7 months

🏁 Instead of allowing other GPUs to idle (while waiting for the straggler) before starting the AllReduce, our algorithm—StragglAR—uses the delay to perform useful communication. With StragglAR, non-straggler GPUs complete a ReduceScatter while waiting for the straggler.

1

0

2

Arjun Devraj

@arjun_devraj_

7 months

🐌 Persistent straggler GPUs delay AllReduce. Distributed ML jobs that use data or tensor parallelism are bottlenecked by AllReduce to communicate gradients/activations in training and inference. We find a *persistent* straggler delays AllReduce in multi-GPU training experiments.

1

0

2

Arjun Devraj

@arjun_devraj_

7 months

Excited to share our preprint: Accelerating AllReduce with a Persistent Straggler 🚀 w/ Eric Ding, @nth_abhishek, Robert Kleinberg, @rachee_singh We design a new algorithm to speed up AllReduce in distributed ML jobs with a persistent straggler GPU 🧵⬇️ https://t.co/VaGaLDCDvY

1

3

12

Rachee Singh

@rachee_singh

1 year

@nth_abhishek presents our work on server-scale photonic interconnects at @ACMSIGCOMM HotNets! Thanks Sujata for chairing the session!

0

4

16

Rachee Singh

@rachee_singh

1 year

I am hiring a postdoc at Cornell for systems research on next-generation multi-GPU interconnects. If you are about to graduate with a PhD in CS or a related field, email me at rachee@cs.cornell.edu with your CV and a representative publication.

1

35

113