GoAbiAryan Profile Banner
Abi Aryan Profile
Abi Aryan

@GoAbiAryan

Followers
7K
Following
15K
Media
460
Statuses
3K

πŸ› οΈ Founder @AbideAI πŸ‘ ML Engineer πŸ‘©β€πŸ’»β˜• πŸ“š Book Author: LLMOps (2025), ✍️ GPU Engg for AI Systems (2026) πŸ’¬πŸ¦ Talk to me about LLMs, MLSys & GPU Training

πŸ‡΅πŸ‡ΉπŸ‡ΊπŸ‡Έ
Joined October 2013
Don't wanna be here? Send us removal request.
@GoAbiAryan
Abi Aryan
15 days
No one teaches this, but this is what really happens when you hit `run` on an LLM. User β†’ API β†’ Engines β†’ Multi-GPU β†’ CUDA β†’ Hardware I mapped every layer (100+ components) of the LLM Inference Stack so you can finally see the full picture. Full blogpost coming soon!
3
7
33
@GoAbiAryan
Abi Aryan
3 days
Finally, some πŸ”₯moves were made on AMD kernels
@_williamhu
William Hu
4 days
AI is compute-hungry. While it has generally relied on a single hardware vendor in the past, AMD GPUs now offer competitive memory and compute throughput. Yet, the software stack is brittle. So we ask: can the same DSL principles that simplified NVIDIA kernel dev translate to
0
0
2
@Logos_network
Logos
2 months
For the first time in history, we can engineer political systems in which people can coordinate without centralised authorities or methods of control. A decentralised future is upon us. And Farewell to Westphalia will help you to navigate it. Out now:
0
107
1K
@GoAbiAryan
Abi Aryan
4 days
I recently mapped out PyTorch internals end to end. Some things this chart highlights, details that don't get discussed often or people get wrong quite a lot- 1️⃣ fullgraph=True only works for prefill, coz it breaks during decoding from control flow and dynamic shapes that
0
1
7
@StasBekman
Stas Bekman
7 days
Added to the Art of Debugging book a new section of getting program's CPU peak memory usage using the little known /usr/bin/time - do not confuse with bash's built-in time https://t.co/LKlQujjyXt While demonstrating when it works and when it doesn't I discuss why we care for
3
22
273
@GoAbiAryan
Abi Aryan
11 days
In the blogpost, I talk about what are the causes for GPU idling and what to do about it Full link:
0
0
1
@GoAbiAryan
Abi Aryan
11 days
My next ModelCraft blogpost is out - The Hidden GPU Crisis in AI Infrastructure 🚨 Most AI companies waste HALF their GPUs and don’t even know it. While everyone's scrambling for the next H100 drop, billions in compute sits idle - burning cash, power, and time. The AI gold
2
2
9
@eliebakouch
elie
16 days
Training LLMs end to end is hard. Very excited to share our new blog (book?) that cover the full pipeline: pre-training, post-training and infra. 200+ pages of what worked, what didn’t, and how to make it run reliably https://t.co/iN2JtWhn23
119
899
6K
@_odsc
ODSC (Open Data Science Conference) AI
23 days
When LLMs slow down, the problem isn’t always your compute – it’s your GPUs. At ODSC AI West 2025, @GoAbiAryan, @abideai, will show how to engineer faster, more efficient AI systems in her talk. πŸ”— Register now β†’ https://t.co/kr21bvpdcg
0
2
6
@GoAbiAryan
Abi Aryan
23 days
Current RL benchmarks test generalization across episodes by training on one world and testing on a slightly different one. AbideGym goes further by testing generalization within an episode, forcing policies to adapt on the fly. Read the full paper at
Tweet card summary image
arxiv.org
Agents trained with reinforcement learning often develop brittle policies that fail when dynamics shift, a problem amplified by static benchmarks. AbideGym, a dynamic MiniGrid wrapper, introduces...
0
0
1
@GoAbiAryan
Abi Aryan
23 days
AbideGym introduces agent-aware perturbations, meaning changes triggered by the agent’s own behavior. How It Works: 1. Timeout based pertubations 2. Dynamic resizing
1
0
1
@GoAbiAryan
Abi Aryan
23 days
AbideGym was created to address exactly that. AbideGym is a dynamic wrapper for the popular MiniGrid environment that turns static RL benchmarks into adaptive and evolving challenges.
1
0
1
@GoAbiAryan
Abi Aryan
23 days
One of the biggest open challenges in RL is adaptation, or how well an agent adjusts when the world around it changes. Most RL agents today look brilliant in static environments but collapse when the rules shift even slightly. In real-world settings.
1
0
4
@GoAbiAryan
Abi Aryan
24 days
Without system level understanding, you make mistakes like- – optimizing kernels when the bottleneck is data pipeline – fusing ops when the real bottleneck is NCCL comm – switching to FP16 when you are compute-bound not memory-bound Tool choice should be the last step, not the
0
1
4
@GoAbiAryan
Abi Aryan
24 days
All done before: – profiling – understanding compute vs memory vs IO limits – checking if GPU is even starved GPU Engineering is not β€œdo CUDA.” It is: model system β†’ profile β†’ intervene with smallest lever.
1
0
3
@GoAbiAryan
Abi Aryan
24 days
Here's how you spot people don't it - ask them why they did it. Some patterns - 1. We are writing custom CUDA kernel because top labs do it. 2. We use FlashAttention because that's State of the Art and XYZ paper said so 3. Let's rewrite dataloader in C++ for speed. Wrong!
2
0
4
@GoAbiAryan
Abi Aryan
24 days
In WWE terms, its called Cargo-cult engineering i.e. copying the surface actions of experts without the causal model that made those actions correct.
1
1
3
@GoAbiAryan
Abi Aryan
24 days
Most engineers think the path is: learn PyTorch β†’ learn CUDA β†’ then you’re valuable. Wrong. The actual path is: understand the system β†’ find the bottleneck β†’ then choose the tool. Tools without a system model is like mindless mimicking.
1
0
6
@GoAbiAryan
Abi Aryan
30 days
why models don't get hallucination right.. there's a fundamental difference between catastrophic forgetting between humans and models.. like for eg. as of today, I am forgetting what I am forgetting.. but my brain remembers that I am forgetting something.. but when it comes
0
0
4
@GoAbiAryan
Abi Aryan
1 month
A 1 petaflop desktop workstation that's blurring the boundary between developer and datacenter πŸ’ͺ. Yes, I am talking about the Nvidia DGX Spark πŸ™ˆ I’d love to benchmark it for real-world LLMOps, GPU scheduling, and distributed inference pipelines and all the stuff that actually
1
0
6