Abi Aryan @GoAbiAryan X Profile

Abi Aryan

@GoAbiAryan

Followers

7K

Following

15K

Media

460

Statuses

3K

🛠️ Founder @AbideAI 👐 ML Engineer 👩‍💻☕ 📚 Book Author: LLMOps (2025), ✍️ GPU Engg for AI Systems (2026) 💬🐦 Talk to me about LLMs, MLSys & GPU Training

https://t.co/TrhYaXjP7i

🇵🇹🇺🇸

Joined October 2013

Don't wanna be here? Send us removal request.

Abi Aryan

@GoAbiAryan

15 days

No one teaches this, but this is what really happens when you hit `run` on an LLM. User → API → Engines → Multi-GPU → CUDA → Hardware I mapped every layer (100+ components) of the LLM Inference Stack so you can finally see the full picture. Full blogpost coming soon!

3

7

33

Abi Aryan

@GoAbiAryan

3 days

Finally, some 🔥moves were made on AMD kernels

William Hu

@_williamhu

4 days

AI is compute-hungry. While it has generally relied on a single hardware vendor in the past, AMD GPUs now offer competitive memory and compute throughput. Yet, the software stack is brittle. So we ask: can the same DSL principles that simplified NVIDIA kernel dev translate to

0

2

Logos

@Logos_network

2 months

For the first time in history, we can engineer political systems in which people can coordinate without centralised authorities or methods of control. A decentralised future is upon us. And Farewell to Westphalia will help you to navigate it. Out now:

0

107

1K

Abi Aryan

@GoAbiAryan

4 days

I recently mapped out PyTorch internals end to end. Some things this chart highlights, details that don't get discussed often or people get wrong quite a lot- 1️⃣ fullgraph=True only works for prefill, coz it breaks during decoding from control flow and dynamic shapes that

0

1

7

Stas Bekman

@StasBekman

7 days

Added to the Art of Debugging book a new section of getting program's CPU peak memory usage using the little known /usr/bin/time - do not confuse with bash's built-in time https://t.co/LKlQujjyXt While demonstrating when it works and when it doesn't I discuss why we care for

3

22

273

Abi Aryan

@GoAbiAryan

11 days

In the blogpost, I talk about what are the causes for GPU idling and what to do about it Full link:

0

1

Abi Aryan

@GoAbiAryan

11 days

My next ModelCraft blogpost is out - The Hidden GPU Crisis in AI Infrastructure 🚨 Most AI companies waste HALF their GPUs and don’t even know it. While everyone's scrambling for the next H100 drop, billions in compute sits idle - burning cash, power, and time. The AI gold

2

9

elie

@eliebakouch

16 days

Training LLMs end to end is hard. Very excited to share our new blog (book?) that cover the full pipeline: pre-training, post-training and infra. 200+ pages of what worked, what didn’t, and how to make it run reliably https://t.co/iN2JtWhn23

119

899

6K

ODSC (Open Data Science Conference) AI

@_odsc

23 days

When LLMs slow down, the problem isn’t always your compute – it’s your GPUs. At ODSC AI West 2025, @GoAbiAryan, @abideai, will show how to engineer faster, more efficient AI systems in her talk. 🔗 Register now → https://t.co/kr21bvpdcg

0

2

6

Abi Aryan

@GoAbiAryan

23 days

Current RL benchmarks test generalization across episodes by training on one world and testing on a slightly different one. AbideGym goes further by testing generalization within an episode, forcing policies to adapt on the fly. Read the full paper at

arxiv.org

Agents trained with reinforcement learning often develop brittle policies that fail when dynamics shift, a problem amplified by static benchmarks. AbideGym, a dynamic MiniGrid wrapper, introduces...

0

1

Abi Aryan

@GoAbiAryan

23 days

AbideGym introduces agent-aware perturbations, meaning changes triggered by the agent’s own behavior. How It Works: 1. Timeout based pertubations 2. Dynamic resizing

1

0

1

Abi Aryan

@GoAbiAryan

23 days

AbideGym was created to address exactly that. AbideGym is a dynamic wrapper for the popular MiniGrid environment that turns static RL benchmarks into adaptive and evolving challenges.

1

0

1

Abi Aryan

@GoAbiAryan

23 days

One of the biggest open challenges in RL is adaptation, or how well an agent adjusts when the world around it changes. Most RL agents today look brilliant in static environments but collapse when the rules shift even slightly. In real-world settings.

1

0

4

Abi Aryan

@GoAbiAryan

24 days

Without system level understanding, you make mistakes like- – optimizing kernels when the bottleneck is data pipeline – fusing ops when the real bottleneck is NCCL comm – switching to FP16 when you are compute-bound not memory-bound Tool choice should be the last step, not the

0

1

4

Abi Aryan

@GoAbiAryan

24 days

All done before: – profiling – understanding compute vs memory vs IO limits – checking if GPU is even starved GPU Engineering is not “do CUDA.” It is: model system → profile → intervene with smallest lever.

1

0

3

Abi Aryan

@GoAbiAryan

24 days

Here's how you spot people don't it - ask them why they did it. Some patterns - 1. We are writing custom CUDA kernel because top labs do it. 2. We use FlashAttention because that's State of the Art and XYZ paper said so 3. Let's rewrite dataloader in C++ for speed. Wrong!

2

0

4

Abi Aryan

@GoAbiAryan

24 days

In WWE terms, its called Cargo-cult engineering i.e. copying the surface actions of experts without the causal model that made those actions correct.

1

3

Abi Aryan

@GoAbiAryan

24 days

Most engineers think the path is: learn PyTorch → learn CUDA → then you’re valuable. Wrong. The actual path is: understand the system → find the bottleneck → then choose the tool. Tools without a system model is like mindless mimicking.

1

0

6

Abi Aryan

@GoAbiAryan

30 days

why models don't get hallucination right.. there's a fundamental difference between catastrophic forgetting between humans and models.. like for eg. as of today, I am forgetting what I am forgetting.. but my brain remembers that I am forgetting something.. but when it comes

0

4

Abi Aryan

@GoAbiAryan

1 month

A 1 petaflop desktop workstation that's blurring the boundary between developer and datacenter 💪. Yes, I am talking about the Nvidia DGX Spark 🙈 I’d love to benchmark it for real-world LLMOps, GPU scheduling, and distributed inference pipelines and all the stuff that actually

1

0

6