This was a huge headache in the early days of @MosaicML, so we built our tooling to seamlessly handle GPU failures. Our platform will detect a faulty node, pause training, cordon the node, sub in a spare, and resume from the most recent checkpoint. All w/o any human intervention Tweet added by Matthew Leavitt @leavittron

Matthew Leavitt

1 year

This was a huge headache in the early days of @MosaicML , so we built our tooling to seamlessly handle GPU failures. Our platform will detect a faulty node, pause training, cordon the node, sub in a spare, and resume from the most recent checkpoint. All w/o any human intervention

John Carmack

@ID_AA_Carmack

1 year

Hardware failures are common while training the largest machine learning models across thousands of GPUs. It is similar to the elder days of computers, when a vacuum tube burning out during your batch computation was a real issue.

52

102

2K

3

5

90

Matthew Leavitt

@leavittron

1 year

Pix or it didn't happen, you say?