uccl_project
@uccl_proj
Followers
76
Following
7
Media
9
Statuses
34
Building the next-generation AI/ML networking
Berkeley, California
Joined June 2025
1/N 📢 Introducing UCCL (Ultra & Unified CCL), an efficient collective communication library for ML training and inference, outperforming NCCL by up to 2.5x 🚀 Code: https://t.co/4kyjuItIBy Blog: https://t.co/pbQcTjc7Vc Results: AllReduce on 6 HGX across 2 racks over RoCE RDMA
1
20
40
11/ 🤝 Collaborators: @ziming_mao, @yangzhouy, Yihan Zhang, Chihan Cui, Zhongjie Chen, Zhiying Xu, @KaichaoYou, Zhen Zhang, Zhenyu Gu, Costin Raiciu, Scott Shenker, @istoica05 from @Berkeley_EECS, @ucdavis, @UWMadison, @Tsinghua_Uni, @awscloud, @AMD, @Broadcom, UPB
0
0
0
10/ 🌍 Vision: UCCL-EP democratizes expert-parallel training — GPU-driven performance without vendor lock-in.
1
0
0
9/ 🛠️ Roadmap: • Optimize EFA path • Port to AMD GPUs + Broadcom NICs (PR #457) • Advanced CPU flow control • Integration with vLLM & SGLang
1
0
0
8/ 📊 Performance: • AWS H200 + EFA (400 Gb/s): Dispatch > 50 GB/s, Combine ≈ 40 GB/s • GH200 + CX7 (200 Gb/s): UCCL-EP even beats DeepEP!
1
0
0
7/ 🌐 Vendor flexibility: UCCL-EP could run on EFA, Broadcom, Pensando, etc. Implements software-level atomics & reordering for EFA’s out-of-order SRD transport. Removes NVSHMEM dependency → faster + portable.
1
0
0
6/ 🔁 Design: A lock-free GPU-CPU FIFO channel allows >50 M RDMA ops/s. Multi-threaded CPU proxies handle flow control, completions, and congestion management — restoring visibility while keeping performance.
1
0
0
5/ 🧩 UCCL-EP’s insight: Keep GPU-initiated logic, but move NIC control back to CPU proxies. GPUs enqueue lightweight commands → CPUs issue RDMA verbs via libibverbs.
1
0
0
4/ ⚙️ DeepEP’s idea: NVIDIA’s DeepEP runs GPU-driven RDMA (IBGDA/NVSHMEM) — GPUs directly ring NIC doorbells for ultra-low latency. 📷 But: it only works on NVIDIA GPUs + NICs. Not on AWS EFA, AMD, Broadcom, etc.
1
0
0
3/ 📉 The challenge: NCCL and other collectives are optimized for large tensors. Tiny EP messages hit latency limits — GPUs stall waiting on communication.
1
0
0
2/ 🧠 What’s Expert Parallelism? EP splits a Mixture-of-Experts model across GPUs — each token is routed to the right “expert.” This means lots of small, frequent GPU-to-GPU transfers (7 KB – 256 KB).
1
0
1
1/🎯 Performance: On EFA, we observe UCCL-EP significantly outperforms other baselines as we increase the number of tokens in dispatch and combine.
1
0
1
🚀 Introducing UCCL-EP: A portable, efficient Expert Parallelism framework that brings DeepEP-level GPU-driven communication with the same APIs to any cloud or hardware — AWS EFA, AMD GPUs, Broadcom NICs and beyond. Blog: https://t.co/d3oBVlWezZ Code: https://t.co/0UbCUYz9N9
1
5
6
9/N🤝Moreover, we provide free consulting on performance-related NCCL/RCCL issues, so reach out to us if you hit any such issues.
0
0
1
8/N🌟UCCL is fully open-source at https://t.co/4kyjuItIBy, with many developers and maintainers from UC Berkeley Sky Computing Lab, the lab that created Spark, Ray, and vLLM. We enthusiastically invite open-source developers to join and contribute to the UCCL project.
github.com
UCCL is an efficient communication library for GPUs, covering collectives, P2P (e.g., KV cache transfer, RL weight transfer), and EP (e.g., GPU-driven) - uccl-project/uccl
1
0
1
7/N🌐UCCL also works for RCCL with AMD GPUs and Broadcom NICs.
1
0
1
6/N 🧵Now with UCCL, we are able to avoid the NCCL performance drop at large messages and further improve its maximum throughput.
1
0
1
5/N🧵We finally guess it could be the RoCE network congestion. Our solution is to apply the UCCL plugin to use multiple network paths in a load-aware manner, simply by specifying NCCL_NET_PLUGIN= https://t.co/k6lCxU1zaU.
1
0
1
4/N🧵Our next guess is that NCCL would require more SMs to do data copy and reduce operations. So we set NCCL_MAX_NCHANNELS=8, NCCL_MIN_NCHANNELS=8. However, even with this, we can only reach 39GB/s, and NCCL performance has severe drops at large messages.
1
0
1
3/N🧵We see all-reduce network bandwidth only reaches 23GB/s out of theoritical 50GB/s. We then set NCCL_IB_QPS_PER_CONNECTION to 4 to use multiple RDMA connections to fully utilize the NIC's two ports. That brings performance to >25GB/s, but still far behind 50GB/s
1
0
1
2/N🧵We observed low all-reduce performance in NCCL, which we suspect to be the networking issues, as the NVLink runs steadily well. So we apply NCCL_P2P_DISABLE=1, NCCL_SHM_DISABLE=1 to look at the network performance.
1
0
1