@leavittron
Matthew Leavitt
1 year
This was a huge headache in the early days of @MosaicML , so we built our tooling to seamlessly handle GPU failures. Our platform will detect a faulty node, pause training, cordon the node, sub in a spare, and resume from the most recent checkpoint. All w/o any human intervention
@ID_AA_Carmack
John Carmack
1 year
Hardware failures are common while training the largest machine learning models across thousands of GPUs. It is similar to the elder days of computers, when a vacuum tube burning out during your batch computation was a real issue.
52
102
2K
3
5
90

Replies

@leavittron
Matthew Leavitt
1 year
Pix or it didn't happen, you say?
Tweet media one
1
1
13
@leavittron
Matthew Leavitt
1 year
There's also a pile of smoldering A100s in the alley behind the office, but I don't think it's polite to post gore on twitter
0
0
9
@code_star
Cody Blakeney
1 year
@leavittron @MosaicML Handling node failures with @MosaicML be like
1
0
7
@code_star
Cody Blakeney
1 year
@leavittron @MosaicML Don’t forget that deterministic resumption because of our streaming datasets!
0
0
7