uccl_proj Profile Banner
uccl_project Profile
uccl_project

@uccl_proj

Followers
76
Following
7
Media
9
Statuses
34

Building the next-generation AI/ML networking

Berkeley, California
Joined June 2025
Don't wanna be here? Send us removal request.
@uccl_proj
uccl_project
5 months
1/N 📢 Introducing UCCL (Ultra & Unified CCL), an efficient collective communication library for ML training and inference, outperforming NCCL by up to 2.5x 🚀 Code: https://t.co/4kyjuItIBy Blog: https://t.co/pbQcTjc7Vc Results: AllReduce on 6 HGX across 2 racks over RoCE RDMA
1
20
40
@uccl_proj
uccl_project
11 days
11/ 🤝 Collaborators: @ziming_mao, @yangzhouy, Yihan Zhang, Chihan Cui, Zhongjie Chen, Zhiying Xu, @KaichaoYou, Zhen Zhang, Zhenyu Gu, Costin Raiciu, Scott Shenker, @istoica05 from @Berkeley_EECS, @ucdavis, @UWMadison, @Tsinghua_Uni, @awscloud, @AMD, @Broadcom, UPB
0
0
0
@uccl_proj
uccl_project
11 days
10/ 🌍 Vision: UCCL-EP democratizes expert-parallel training — GPU-driven performance without vendor lock-in.
1
0
0
@uccl_proj
uccl_project
11 days
9/ 🛠️ Roadmap: • Optimize EFA path • Port to AMD GPUs + Broadcom NICs (PR #457) • Advanced CPU flow control • Integration with vLLM & SGLang
1
0
0
@uccl_proj
uccl_project
11 days
8/ 📊 Performance: • AWS H200 + EFA (400 Gb/s): Dispatch > 50 GB/s, Combine ≈ 40 GB/s • GH200 + CX7 (200 Gb/s): UCCL-EP even beats DeepEP!
1
0
0
@uccl_proj
uccl_project
11 days
7/ 🌐 Vendor flexibility: UCCL-EP could run on EFA, Broadcom, Pensando, etc. Implements software-level atomics & reordering for EFA’s out-of-order SRD transport. Removes NVSHMEM dependency → faster + portable.
1
0
0
@uccl_proj
uccl_project
11 days
6/ 🔁 Design: A lock-free GPU-CPU FIFO channel allows >50 M RDMA ops/s. Multi-threaded CPU proxies handle flow control, completions, and congestion management — restoring visibility while keeping performance.
1
0
0
@uccl_proj
uccl_project
11 days
5/ 🧩 UCCL-EP’s insight: Keep GPU-initiated logic, but move NIC control back to CPU proxies. GPUs enqueue lightweight commands → CPUs issue RDMA verbs via libibverbs.
1
0
0
@uccl_proj
uccl_project
11 days
4/ ⚙️ DeepEP’s idea: NVIDIA’s DeepEP runs GPU-driven RDMA (IBGDA/NVSHMEM) — GPUs directly ring NIC doorbells for ultra-low latency. 📷 But: it only works on NVIDIA GPUs + NICs. Not on AWS EFA, AMD, Broadcom, etc.
1
0
0
@uccl_proj
uccl_project
11 days
3/ 📉 The challenge: NCCL and other collectives are optimized for large tensors. Tiny EP messages hit latency limits — GPUs stall waiting on communication.
1
0
0
@uccl_proj
uccl_project
11 days
2/ 🧠 What’s Expert Parallelism? EP splits a Mixture-of-Experts model across GPUs — each token is routed to the right “expert.” This means lots of small, frequent GPU-to-GPU transfers (7 KB – 256 KB).
1
0
1
@uccl_proj
uccl_project
11 days
1/🎯 Performance: On EFA, we observe UCCL-EP significantly outperforms other baselines as we increase the number of tokens in dispatch and combine.
1
0
1
@uccl_proj
uccl_project
11 days
🚀 Introducing UCCL-EP: A portable, efficient Expert Parallelism framework that brings DeepEP-level GPU-driven communication with the same APIs to any cloud or hardware — AWS EFA, AMD GPUs, Broadcom NICs and beyond. Blog: https://t.co/d3oBVlWezZ Code: https://t.co/0UbCUYz9N9
1
5
6
@uccl_proj
uccl_project
4 months
9/N🤝Moreover, we provide free consulting on performance-related NCCL/RCCL issues, so reach out to us if you hit any such issues.
0
0
1
@uccl_proj
uccl_project
4 months
8/N🌟UCCL is fully open-source at https://t.co/4kyjuItIBy, with many developers and maintainers from UC Berkeley Sky Computing Lab, the lab that created Spark, Ray, and vLLM. We enthusiastically invite open-source developers to join and contribute to the UCCL project.
Tweet card summary image
github.com
UCCL is an efficient communication library for GPUs, covering collectives, P2P (e.g., KV cache transfer, RL weight transfer), and EP (e.g., GPU-driven) - uccl-project/uccl
1
0
1
@uccl_proj
uccl_project
4 months
7/N🌐UCCL also works for RCCL with AMD GPUs and Broadcom NICs.
1
0
1
@uccl_proj
uccl_project
4 months
6/N 🧵Now with UCCL, we are able to avoid the NCCL performance drop at large messages and further improve its maximum throughput.
1
0
1
@uccl_proj
uccl_project
4 months
5/N🧵We finally guess it could be the RoCE network congestion. Our solution is to apply the UCCL plugin to use multiple network paths in a load-aware manner, simply by specifying NCCL_NET_PLUGIN= https://t.co/k6lCxU1zaU.
1
0
1
@uccl_proj
uccl_project
4 months
4/N🧵Our next guess is that NCCL would require more SMs to do data copy and reduce operations. So we set NCCL_MAX_NCHANNELS=8, NCCL_MIN_NCHANNELS=8. However, even with this, we can only reach 39GB/s, and NCCL performance has severe drops at large messages.
1
0
1
@uccl_proj
uccl_project
4 months
3/N🧵We see all-reduce network bandwidth only reaches 23GB/s out of theoritical 50GB/s. We then set NCCL_IB_QPS_PER_CONNECTION to 4 to use multiple RDMA connections to fully utilize the NIC's two ports. That brings performance to >25GB/s, but still far behind 50GB/s
1
0
1
@uccl_proj
uccl_project
4 months
2/N🧵We observed low all-reduce performance in NCCL, which we suspect to be the networking issues, as the NVLink runs steadily well. So we apply NCCL_P2P_DISABLE=1, NCCL_SHM_DISABLE=1 to look at the network performance.
1
0
1