Abi Aryan
@GoAbiAryan
Followers
7K
Following
15K
Media
460
Statuses
3K
π οΈ Founder @AbideAI π ML Engineer π©βπ»β π Book Author: LLMOps (2025), βοΈ GPU Engg for AI Systems (2026) π¬π¦ Talk to me about LLMs, MLSys & GPU Training
π΅πΉπΊπΈ
Joined October 2013
No one teaches this, but this is what really happens when you hit `run` on an LLM. User β API β Engines β Multi-GPU β CUDA β Hardware I mapped every layer (100+ components) of the LLM Inference Stack so you can finally see the full picture. Full blogpost coming soon!
3
7
33
Finally, some π₯moves were made on AMD kernels
AI is compute-hungry. While it has generally relied on a single hardware vendor in the past, AMD GPUs now offer competitive memory and compute throughput. Yet, the software stack is brittle. So we ask: can the same DSL principles that simplified NVIDIA kernel dev translate to
0
0
2
For the first time in history, we can engineer political systems in which people can coordinate without centralised authorities or methods of control. A decentralised future is upon us. And Farewell to Westphalia will help you to navigate it. Out now:
0
107
1K
I recently mapped out PyTorch internals end to end. Some things this chart highlights, details that don't get discussed often or people get wrong quite a lot- 1οΈβ£ fullgraph=True only works for prefill, coz it breaks during decoding from control flow and dynamic shapes that
0
1
7
Added to the Art of Debugging book a new section of getting program's CPU peak memory usage using the little known /usr/bin/time - do not confuse with bash's built-in time https://t.co/LKlQujjyXt While demonstrating when it works and when it doesn't I discuss why we care for
3
22
273
In the blogpost, I talk about what are the causes for GPU idling and what to do about it Full link:
0
0
1
My next ModelCraft blogpost is out - The Hidden GPU Crisis in AI Infrastructure π¨ Most AI companies waste HALF their GPUs and donβt even know it. While everyone's scrambling for the next H100 drop, billions in compute sits idle - burning cash, power, and time. The AI gold
2
2
9
Training LLMs end to end is hard. Very excited to share our new blog (book?) that cover the full pipeline: pre-training, post-training and infra. 200+ pages of what worked, what didnβt, and how to make it run reliably https://t.co/iN2JtWhn23
119
899
6K
When LLMs slow down, the problem isnβt always your compute β itβs your GPUs. At ODSC AI West 2025, @GoAbiAryan, @abideai, will show how to engineer faster, more efficient AI systems in her talk. π Register now β https://t.co/kr21bvpdcg
0
2
6
Current RL benchmarks test generalization across episodes by training on one world and testing on a slightly different one. AbideGym goes further by testing generalization within an episode, forcing policies to adapt on the fly. Read the full paper at
arxiv.org
Agents trained with reinforcement learning often develop brittle policies that fail when dynamics shift, a problem amplified by static benchmarks. AbideGym, a dynamic MiniGrid wrapper, introduces...
0
0
1
AbideGym introduces agent-aware perturbations, meaning changes triggered by the agentβs own behavior. How It Works: 1. Timeout based pertubations 2. Dynamic resizing
1
0
1
AbideGym was created to address exactly that. AbideGym is a dynamic wrapper for the popular MiniGrid environment that turns static RL benchmarks into adaptive and evolving challenges.
1
0
1
One of the biggest open challenges in RL is adaptation, or how well an agent adjusts when the world around it changes. Most RL agents today look brilliant in static environments but collapse when the rules shift even slightly. In real-world settings.
1
0
4
Without system level understanding, you make mistakes like- β optimizing kernels when the bottleneck is data pipeline β fusing ops when the real bottleneck is NCCL comm β switching to FP16 when you are compute-bound not memory-bound Tool choice should be the last step, not the
0
1
4
All done before: β profiling β understanding compute vs memory vs IO limits β checking if GPU is even starved GPU Engineering is not βdo CUDA.β It is: model system β profile β intervene with smallest lever.
1
0
3
Here's how you spot people don't it - ask them why they did it. Some patterns - 1. We are writing custom CUDA kernel because top labs do it. 2. We use FlashAttention because that's State of the Art and XYZ paper said so 3. Let's rewrite dataloader in C++ for speed. Wrong!
2
0
4
In WWE terms, its called Cargo-cult engineering i.e. copying the surface actions of experts without the causal model that made those actions correct.
1
1
3
Most engineers think the path is: learn PyTorch β learn CUDA β then youβre valuable. Wrong. The actual path is: understand the system β find the bottleneck β then choose the tool. Tools without a system model is like mindless mimicking.
1
0
6
why models don't get hallucination right.. there's a fundamental difference between catastrophic forgetting between humans and models.. like for eg. as of today, I am forgetting what I am forgetting.. but my brain remembers that I am forgetting something.. but when it comes
0
0
4
A 1 petaflop desktop workstation that's blurring the boundary between developer and datacenter πͺ. Yes, I am talking about the Nvidia DGX Spark π Iβd love to benchmark it for real-world LLMOps, GPU scheduling, and distributed inference pipelines and all the stuff that actually
1
0
6