uccl_proj Profile Banner
uccl_project Profile
uccl_project

@uccl_proj

Followers
44
Following
7
Media
7
Statuses
22

Building the next-generation AI/ML networking

Berkeley, California
Joined June 2025
Don't wanna be here? Send us removal request.
@uccl_proj
uccl_project
25 days
1/N 📢 Introducing UCCL (Ultra & Unified CCL), an efficient collective communication library for ML training and inference, outperforming NCCL by up to 2.5x 🚀. Code: Blog: Results: AllReduce on 6 HGX across 2 racks over RoCE RDMA
Tweet media one
1
17
39
@uccl_proj
uccl_project
7 days
9/N🤝Moreover, we provide free consulting on performance-related NCCL/RCCL issues, so reach out to us if you hit any such issues.
0
0
1
@uccl_proj
uccl_project
7 days
8/N🌟UCCL is fully open-source at with many developers and maintainers from UC Berkeley Sky Computing Lab, the lab that created Spark, Ray, and vLLM. We enthusiastically invite open-source developers to join and contribute to the UCCL project.
1
0
1
@uccl_proj
uccl_project
7 days
7/N🌐UCCL also works for RCCL with AMD GPUs and Broadcom NICs.
Tweet media one
1
0
1
@uccl_proj
uccl_project
7 days
6/N 🧵Now with UCCL, we are able to avoid the NCCL performance drop at large messages and further improve its maximum throughput.
Tweet media one
1
0
1
@uccl_proj
uccl_project
7 days
5/N🧵We finally guess it could be the RoCE network congestion. Our solution is to apply the UCCL plugin to use multiple network paths in a load-aware manner, simply by specifying NCCL_NET_PLUGIN=
Tweet media one
1
0
1
@uccl_proj
uccl_project
7 days
4/N🧵Our next guess is that NCCL would require more SMs to do data copy and reduce operations. So we set NCCL_MAX_NCHANNELS=8, NCCL_MIN_NCHANNELS=8. However, even with this, we can only reach 39GB/s, and NCCL performance has severe drops at large messages.
1
0
1
@uccl_proj
uccl_project
7 days
3/N🧵We see all-reduce network bandwidth only reaches 23GB/s out of theoritical 50GB/s. We then set NCCL_IB_QPS_PER_CONNECTION to 4 to use multiple RDMA connections to fully utilize the NIC's two ports. That brings performance to >25GB/s, but still far behind 50GB/s.
1
0
1
@uccl_proj
uccl_project
7 days
2/N🧵We observed low all-reduce performance in NCCL, which we suspect to be the networking issues, as the NVLink runs steadily well. So we apply NCCL_P2P_DISABLE=1, NCCL_SHM_DISABLE=1 to look at the network performance.
1
0
1
@uccl_proj
uccl_project
7 days
1/N📢 Debugging NCCL performance problems for LLM workloads is always challenging. In this blog post, we explore various perf-critical parameters in NCCL and tackle datacenter network congestions with UCCL plugin.
Tweet media one
1
1
6
@uccl_proj
uccl_project
25 days
We add an updated NCCL vs. UCCL performance figure with an explanation on why there is a sudden performance drop for NCCL.
Tweet media one
@uccl_proj
uccl_project
25 days
1/N 📢 Introducing UCCL (Ultra & Unified CCL), an efficient collective communication library for ML training and inference, outperforming NCCL by up to 2.5x 🚀. Code: Blog: Results: AllReduce on 6 HGX across 2 racks over RoCE RDMA
Tweet media one
0
0
1
@uccl_proj
uccl_project
25 days
RT @ziming_mao: Excited to share UCCL! RDMA networks is slow to evolve, causing performance bottlenecks for ML workloads. UCCL tackles this….
0
3
0
@uccl_proj
uccl_project
25 days
RT @yangzhouy: Excited to release UCCL—come build the next-gen AI/ML networking solution with us! Also, if you hit any networking problems….
0
3
0
@uccl_proj
uccl_project
25 days
10/N 🙏 We also thank the generous support from @BerkeleySky, @AMD, @IBM, @LambdaAPI, @Broadcom, @googlecloud and CloudLab.
0
0
3
@uccl_proj
uccl_project
25 days
9/N 🤝 UCCL is a team effort: @yangzhouy, Zhongjie Chen, @ziming_mao, @laochonlam, Shuo Yang, @praveingk, Jiaqi Gao, Yilong Zhao, Yongji Wu, @KaichaoYou, Fengyuan Ren, Zhiying Xu, Costin Raiciu, and @istoica05.
1
0
1
@uccl_proj
uccl_project
25 days
8/N ➡️ What's next: We will share our performance optimization experience and supersizing findings regarding RoCE vs. IB, when running UCCL on an ML testbed from IBM. Stay tuned 👀.
1
0
1
@uccl_proj
uccl_project
25 days
7/N 🌟 All the code and scripts are open-sourced, and we are actively developing UCCL in the GitHub repo: We'd love to invite the community to try it out and come build with us!.
1
0
1
@uccl_proj
uccl_project
25 days
6/N: 🔍 UCCL will also focus on P2P communication for MoE models and KV cache transfers. Our current goal is to make GPU-initiated P2P communication more efficient and generic (eg, bringing DeepEP/IBGDA to AMD GPUs and Broadcom NICs).
1
0
1
@uccl_proj
uccl_project
25 days
5/N 🌐 UCCL provides a drop-in replacement for any NCCL application without code modification or compilation. Further, UCCL is not limited to GPUs and RDMA from Nvidia; it supports AMD GPUs, AWS RDMA, and non-RDMA NICs (like IBM VirtIO).
1
0
1
@uccl_proj
uccl_project
25 days
4/N⚡Instead, UCCL enables packet spraying in software to leverage abundant network paths (eg, 256) to avoid "single-path-of-congestion". The previous NCCL vs. UCCL figure exactly shows that UCCL successfully avoids network congestion for large messages.
Tweet media one
1
0
1
@uccl_proj
uccl_project
25 days
3/N💡UCCL has built a fast and extensible transport layer in software, bringing huge benefits. Specifically, existing network transports in NCCL (i.e., TCP and RDMA) leverage one or few network paths to transfer huge data, thus prone to congestion in datacenter networks.
1
0
1