Research Log Day 2: HALoS Communication between regions is drastically worse than within a region. Most previous DiLoCo variants didn't take this into account. HALoS (Hierarchical Async Local SGD) fixes this by adding local and global parameter servers (1/n)
1
0
3
Replies
For standard DiLoCo (and Streaming DiLoCo), we suffer from the straggler effect due to having to wait on workers with low-bandwidth network links. For variants that mask communication with computation, we suffer worse convergence due to either... (2/n)
1
0
1
requiring a higher number of inner steps to fully mask the communication (one-step-delay DiLoCo and variants) or taking more steps on stale local parameters while waiting for the updated parameters (Overlap Local-SGD). (3/n)
1
0
1
To first understand HALoS, we need to first understand Async Local-SGD. Essentially, instead of each worker synchronizing every H steps, instead when they finish, they independently push their updates to a parameter server without waiting on others. (4/n)
1
0
0
While this means we don’t suffer synchronization costs, we now suffer from staleness because we are applying each worker's outer-update step individually and synchronously instead of taking one step using the average. This staleness can lead to convergence issues. (5/n)
1
0
0
HALoS doesn’t directly address the staleness issue. Instead, it focuses on minimizing computation idle time. HALoS introduces Local Parameter Servers (LPS) within each region and a global parameter server (GPS) which merges updates across regions. (6/n)
1
0
0
One way to think about HALoS is that we are running multiple instances of Async Local-SGD, each having multiple workers within the same region. We treat each LPS as a normal Async Local-SGD worker and have another parameter server (the GPS) which they send updates to. (7/n)
1
0
0
When a LPS communicates with the GPS, because of the lower bandwidth and increased latency, the LPS continues to apply updates from its workers. When it receives the updated parameters from the GPS, it merges it with its updated local parameters instead of replacing them. (8/n)
1
0
0