@nathanbarrydev
Nathan Barry
11 days
Research Log Day 2: HALoS Communication between regions is drastically worse than within a region. Most previous DiLoCo variants didn't take this into account. HALoS (Hierarchical Async Local SGD) fixes this by adding local and global parameter servers (1/n)
1
0
3

Replies

@nathanbarrydev
Nathan Barry
11 days
For standard DiLoCo (and Streaming DiLoCo), we suffer from the straggler effect due to having to wait on workers with low-bandwidth network links. For variants that mask communication with computation, we suffer worse convergence due to either... (2/n)
1
0
1
@nathanbarrydev
Nathan Barry
11 days
requiring a higher number of inner steps to fully mask the communication (one-step-delay DiLoCo and variants) or taking more steps on stale local parameters while waiting for the updated parameters (Overlap Local-SGD). (3/n)
1
0
1
@nathanbarrydev
Nathan Barry
11 days
To first understand HALoS, we need to first understand Async Local-SGD. Essentially, instead of each worker synchronizing every H steps, instead when they finish, they independently push their updates to a parameter server without waiting on others. (4/n)
1
0
0
@nathanbarrydev
Nathan Barry
11 days
While this means we don’t suffer synchronization costs, we now suffer from staleness because we are applying each worker's outer-update step individually and synchronously instead of taking one step using the average. This staleness can lead to convergence issues. (5/n)
1
0
0
@nathanbarrydev
Nathan Barry
11 days
HALoS doesn’t directly address the staleness issue. Instead, it focuses on minimizing computation idle time. HALoS introduces Local Parameter Servers (LPS) within each region and a global parameter server (GPS) which merges updates across regions. (6/n)
1
0
0
@nathanbarrydev
Nathan Barry
11 days
One way to think about HALoS is that we are running multiple instances of Async Local-SGD, each having multiple workers within the same region. We treat each LPS as a normal Async Local-SGD worker and have another parameter server (the GPS) which they send updates to. (7/n)
1
0
0
@nathanbarrydev
Nathan Barry
11 days
When a LPS communicates with the GPS, because of the lower bandwidth and increased latency, the LPS continues to apply updates from its workers. When it receives the updated parameters from the GPS, it merges it with its updated local parameters instead of replacing them. (8/n)
1
0
0
@nathanbarrydev
Nathan Barry
11 days
Their open-source simulator currently supports DiLoCo, Async Local-SGD, and HALoS. It would not be hard to add Overlap Local-SGD or One-step-delay/Eager Update DiLoCo (although Streaming DiLoCo would require a major rewrite). I'll be building off it to run my experiments (9/n)
1
0
3
@nathanbarrydev
Nathan Barry
11 days
0
0
2