This was a huge headache in the early days of
@MosaicML
, so we built our tooling to seamlessly handle GPU failures. Our platform will detect a faulty node, pause training, cordon the node, sub in a spare, and resume from the most recent checkpoint. All w/o any human intervention
Hardware failures are common while training the largest machine learning models across thousands of GPUs. It is similar to the elder days of computers, when a vacuum tube burning out during your batch computation was a real issue.