
Matej Sirovatka
@m_sirovatka
Followers
714
Following
550
Media
38
Statuses
398
I think hyper-params won't help me with this loss curve (I hate gradient accumulation).
What you can't build you don't understand. Well apparently I can't build a toy pre-training framework? .I need a refresher on current pre-training trends, any good papers? (looking at you @eliebakouch ).
1
1
9
What you can't build you don't understand. Well apparently I can't build a toy pre-training framework? .I need a refresher on current pre-training trends, any good papers? (looking at you @eliebakouch ).
1
0
15
learning cutedsl before losing your mind over complement in layout algebra.
4
0
27
A question to my hw rich friends, I'm currently gpu rich-ish, how can I become TPU rich-ish (I hate ssh-ing to a collab instance). I just need a little, 4/8 TPUs to run funny stuff at.
7
0
42
You have to wait for the best. Incredibly honoured to be a part of this course, together with this bunch of cool people.
Day 14 of 14 Days of Distributed!. We've got a number of cool people still that are talking since we started this list, so today we're going to rapid fire them all (in no particular order)! Let's buckle up and go!. @winglian @FerdinandMom @m_sirovatka @mervenoyann @charles_irl
1
0
7
Holidays going great, exactly 1 day without work. Btw tune in on @GPU_MODE for a talk about PCCL from @PrimeIntellect in 30min
1
1
82
RT @AIatAMD: Calling all GPU & AI developers, it’s go time!. Join the AMD Developer Challenge 2025! Optimize multi-GPU kernels, win prizes….
0
10
0
RT @_marcsun: Happy to participate in the online course by my mentor @TheZachMueller ! The topic of my talk will be efficient distributed i….
0
5
0
something’s cooking.
Oct 17 at Toronto School of Foundation Modelling:.@m_sirovatka will talk about model sharding, network topologies of large-scale clusters and how these pieces connect.
0
2
15
The competition runs for 6 weeks, starting August 30th, after which AMD will fly the winners for a celebration in US! As per usual, the grand price is 100k$ 💰, with smaller prices for other top contestants 👀.Register here rn!.
amdchallenge2025.datamonsters.com
In this challenge sponsored by Advanced Micro Devices, Inc. (“AMD”), participants are invited to form up to a 3-member team to develop and optimize low-level kernels and deliver significant perform...
0
1
4
After 2 weeks, we're taking a detour to tensor parallelism, optimising GEMM + Reduce Scatter, and finishing up with All-Gather + GEMM, covering the most common parallelisms in large model training and inference 📈 All on a full node of MI300s 🐳.
1
1
5
We are gonna give you FULL 8xMI300 node, all for free, to write the fastest kernels! Competition is gonna last for 6 weeks, with a problem being released every 2 weeks. We're starting August 30th, with ALL2ALL Dispatch + Combine across 8 GPUs, to make MOEs brrr ⚡️.
1
0
3
To learn more about context parallelism, you can also read our doc 📖.
github.com
🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed...
0
0
5
We fully integrated N-D Parallelism into Trainer, supporting any configuration you might like, including FSDP, tensor parallel and so on 📈.You can find a full example on how to use this in the accelerate repository.
github.com
🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed...
1
0
10
Context parallelism in 🤗 transformers Trainer?.Training models on 100k+ sequence length has never been easier 🚀
3
17
129