uccl_project @uccl_proj X Profile

uccl_project

@uccl_proj

Followers

44

Following

7

Media

7

Statuses

22

Building the next-generation AI/ML networking

Berkeley, California

Joined June 2025

Don't wanna be here? Send us removal request.

uccl_project

@uccl_proj

25 days

1/N 📢 Introducing UCCL (Ultra & Unified CCL), an efficient collective communication library for ML training and inference, outperforming NCCL by up to 2.5x 🚀. Code: Blog: Results: AllReduce on 6 HGX across 2 racks over RoCE RDMA

1

17

39

uccl_project

@uccl_proj

7 days

9/N🤝Moreover, we provide free consulting on performance-related NCCL/RCCL issues, so reach out to us if you hit any such issues.

0

1

uccl_project

@uccl_proj

7 days

8/N🌟UCCL is fully open-source at with many developers and maintainers from UC Berkeley Sky Computing Lab, the lab that created Spark, Ray, and vLLM. We enthusiastically invite open-source developers to join and contribute to the UCCL project.

1

0

1

uccl_project

@uccl_proj

7 days

7/N🌐UCCL also works for RCCL with AMD GPUs and Broadcom NICs.

1

0

1

uccl_project

@uccl_proj

7 days

6/N 🧵Now with UCCL, we are able to avoid the NCCL performance drop at large messages and further improve its maximum throughput.

1

0

1

uccl_project

@uccl_proj

7 days

5/N🧵We finally guess it could be the RoCE network congestion. Our solution is to apply the UCCL plugin to use multiple network paths in a load-aware manner, simply by specifying NCCL_NET_PLUGIN=

1

0

1

uccl_project

@uccl_proj

7 days

4/N🧵Our next guess is that NCCL would require more SMs to do data copy and reduce operations. So we set NCCL_MAX_NCHANNELS=8, NCCL_MIN_NCHANNELS=8. However, even with this, we can only reach 39GB/s, and NCCL performance has severe drops at large messages.

1

0

1

uccl_project

@uccl_proj

7 days

3/N🧵We see all-reduce network bandwidth only reaches 23GB/s out of theoritical 50GB/s. We then set NCCL_IB_QPS_PER_CONNECTION to 4 to use multiple RDMA connections to fully utilize the NIC's two ports. That brings performance to >25GB/s, but still far behind 50GB/s.

1

0

1

uccl_project

@uccl_proj

7 days

2/N🧵We observed low all-reduce performance in NCCL, which we suspect to be the networking issues, as the NVLink runs steadily well. So we apply NCCL_P2P_DISABLE=1, NCCL_SHM_DISABLE=1 to look at the network performance.

1

0

1

uccl_project

@uccl_proj

7 days

1/N📢 Debugging NCCL performance problems for LLM workloads is always challenging. In this blog post, we explore various perf-critical parameters in NCCL and tackle datacenter network congestions with UCCL plugin.

1

6

uccl_project

@uccl_proj

25 days

We add an updated NCCL vs. UCCL performance figure with an explanation on why there is a sudden performance drop for NCCL.

uccl_project

@uccl_proj

25 days

1/N 📢 Introducing UCCL (Ultra & Unified CCL), an efficient collective communication library for ML training and inference, outperforming NCCL by up to 2.5x 🚀. Code: Blog: Results: AllReduce on 6 HGX across 2 racks over RoCE RDMA

0

1

uccl_project

@uccl_proj

25 days

RT @ziming_mao: Excited to share UCCL! RDMA networks is slow to evolve, causing performance bottlenecks for ML workloads. UCCL tackles this….

0

3

0

uccl_project

@uccl_proj

25 days

RT @yangzhouy: Excited to release UCCL—come build the next-gen AI/ML networking solution with us! Also, if you hit any networking problems….

0

3

0

uccl_project

@uccl_proj

25 days

10/N 🙏 We also thank the generous support from @BerkeleySky, @AMD, @IBM, @LambdaAPI, @Broadcom, @googlecloud and CloudLab.

0

3

uccl_project

@uccl_proj

25 days

9/N 🤝 UCCL is a team effort: @yangzhouy, Zhongjie Chen, @ziming_mao, @laochonlam, Shuo Yang, @praveingk, Jiaqi Gao, Yilong Zhao, Yongji Wu, @KaichaoYou, Fengyuan Ren, Zhiying Xu, Costin Raiciu, and @istoica05.

1

0

1

uccl_project

@uccl_proj

25 days

8/N ➡️ What's next: We will share our performance optimization experience and supersizing findings regarding RoCE vs. IB, when running UCCL on an ML testbed from IBM. Stay tuned 👀.

1

0

1

uccl_project

@uccl_proj

25 days

7/N 🌟 All the code and scripts are open-sourced, and we are actively developing UCCL in the GitHub repo: We'd love to invite the community to try it out and come build with us!.

1

0

1

uccl_project

@uccl_proj

25 days

6/N: 🔍 UCCL will also focus on P2P communication for MoE models and KV cache transfers. Our current goal is to make GPU-initiated P2P communication more efficient and generic (eg, bringing DeepEP/IBGDA to AMD GPUs and Broadcom NICs).

1

0

1

uccl_project

@uccl_proj

25 days

5/N 🌐 UCCL provides a drop-in replacement for any NCCL application without code modification or compilation. Further, UCCL is not limited to GPUs and RDMA from Nvidia; it supports AMD GPUs, AWS RDMA, and non-RDMA NICs (like IBM VirtIO).

1

0

1

uccl_project

@uccl_proj

25 days

4/N⚡Instead, UCCL enables packet spraying in software to leverage abundant network paths (eg, 256) to avoid "single-path-of-congestion". The previous NCCL vs. UCCL figure exactly shows that UCCL successfully avoids network congestion for large messages.

1

0

1

uccl_project

@uccl_proj

25 days

3/N💡UCCL has built a fast and extensible transport layer in software, bringing huge benefits. Specifically, existing network transports in NCCL (i.e., TCP and RDMA) leverage one or few network paths to transfer huge data, thus prone to congestion in datacenter networks.

1

0

1