Venkat Raman — inference/acc @venkat_systems X Profile

Venkat Raman — inference/acc

@venkat_systems

Followers

347

Following

3K

Media

89

Statuses

1K

distributed systems, low latency, inference, cuda | 🦀 | hobbies: ⛷️ 🏊🏽‍♂️ 📷

https://t.co/XsbWyD28dN

µs, ns, 80% speed-of-light

Joined January 2013

Don't wanna be here? Send us removal request.

Venkat Raman — inference/acc

@venkat_systems

16 hours

inspired by @uccl_proj n thunderkittens by @HazyResearch 🙏, my attempt at successfully challenging nvidia nccl perf: - starting with single process p2p all_reduce sum on 2xA100. - mpi p2p, h100, b200 n b300 is wip. - will oss soon once apis n abstractions are stable

0

2

kache

@yacineMTB

2 days

the gap between opus 4.5 and every other model is insane

160

51

2K

Venkat Raman — inference/acc

@venkat_systems

21 hours

huggingface tgi & mistral dot rs engines are written in rust. However if we look at intranode tp inference, they use uds. they also use protobuf over pickle. but vllm & sglang (uses same parts of vllm) use cpu shmem the former are leaving so much perf gains on the table for

Venkat Raman — inference/acc

@venkat_systems

2 days

vllm n sglang cpu-side engine overhead can be ns instead of µs, if written in c++ / rust.. this in turn will improve gpu util 30-60% gain in sustained goodput / tco based on my local experiments but it comes at a cost of research to production speed n researcher friendly

0

1

Venkat Raman — inference/acc

@venkat_systems

22 hours

or maybe i lack abundance mindset 😅

0

Venkat Raman — inference/acc

@venkat_systems

22 hours

i might be late to this.... in last 15 years, there are different generations of support, lead gen, crm, marketing, sales tools... let's say 2-3 players per market segment and per geo region still there are so many startups in this space... in ai era, this is on speeed feels

2

0

Venkat Raman — inference/acc

@venkat_systems

2 days

is anyone from openai still using codex cli internally ? i miss using it.. i want to, but every time it just shits the bed 😭😭

0

Venkat Raman — inference/acc

@venkat_systems

2 days

https://t.co/MM5uNqG7rO

Venkat Raman — inference/acc

@venkat_systems

2 days

@thorstenball i was thinking along these lines too talk is cheap -> code is cheap is so surreal.. enabled by VC & labs subsidizing vibe coding all the oss agent frameworks are fungible.. infra on the other hand is not basic app layer infra is starting to get fungible too… next 2-3years

0

Venkat Raman — inference/acc

@venkat_systems

2 days

aged like wine ! (just a day old though 😜)

Anthropic

@AnthropicAI

2 days

Anthropic is acquiring @bunjavascript to further accelerate Claude Code’s growth. We're delighted that Bun—which has dramatically improved the JavaScript and TypeScript developer experience—is joining us to make Claude Code even better. Read more:

1

0

Venkat Raman — inference/acc

@venkat_systems

2 days

vllm n sglang cpu-side engine overhead can be ns instead of µs, if written in c++ / rust.. this in turn will improve gpu util 30-60% gain in sustained goodput / tco based on my local experiments but it comes at a cost of research to production speed n researcher friendly

2

0

3

Matej Sirovatka

@m_sirovatka

4 days

After 3 weeks, we have concluded our first problem of the @GPU_MODE x @nvidia competition, NVFP4 GEMV. Thanks to everyone who has participated, we have collected over 40k submissions from >200 users. Congrats to the winners and good luck with the next problem, NVFP4 GEMM 🔥

7

15

181

Venkat Raman — inference/acc

@venkat_systems

5 days

I agree @claudeai ASCII charts are bangers !

Ash Vardanian

@ashvardanian

6 days

With some Claude-generated ASCII charts, StringZilla now looks quite competitive for: 1. non-cryptographic hashing (vs xxHash, aHash?) 2. exact substring and byte-set search (vs memchr?) 3. UTF-8 tokenization (vs standard libs, regex, ICU?) Will update StringWars benchmark

0

2

Venkat Raman — inference/acc

@venkat_systems

7 days

update: found out that this optimization is not really valid as it doesn’t adhere to challenge spirit.. there is a benchmarking bug… reverted it with @m_sirovatka ‘s help @GPU_MODE discord is amazing 🙏🏽 wish i started there sooner than just their youtube channel

Venkat Raman — inference/acc

@venkat_systems

8 days

finally cursor <> gemini 3 pro worked for me... helped to beat my personal best 55 µs --> 24.4 µs gpu mode's Blackwell NVFP4 Kernel Hackathon finally in top 12 now 😅🎉

0

3

Venkat Raman — inference/acc

@venkat_systems

7 days

gemini 3 pro is also suffering from infinite loops similar to 2.5 pro

0

1

Venkat Raman — inference/acc

@venkat_systems

8 days

CUDA moat also comes from well designed abstractions and APIs that are backwards compatible.. let's take NCCL for example.. it supports different execution mode, topologies, diverse hw interconnects (nvlink, pcie, etc.,) one can get started quickly and take things to

Venkat Raman — inference/acc

@venkat_systems

19 days

@zephyr_z9 CUDA moat is same as python moat.. ecosystem n researchers n gpu engineers love it.. actual programmability.. nvidia provides low level apis so that u can write better n more performant versions of their high order frameworks.. i agree with u automated, manual or ai assisted

0

Venkat Raman — inference/acc

@venkat_systems

8 days

finally cursor <> gemini 3 pro worked for me... helped to beat my personal best 55 µs --> 24.4 µs gpu mode's Blackwell NVFP4 Kernel Hackathon finally in top 12 now 😅🎉

Venkat Raman — inference/acc

@venkat_systems

9 days

5 hours with cursor <> 4.5 opus Kernel #1 - NVFP4 Batched GEMV beat my personal best 106 µs --> 55 µs gpu mode's Blackwell NVFP4 Kernel Hackathon

0

Venkat Raman — inference/acc

@venkat_systems

9 days

HFT, system software, gpu kernels dev will teach you that 10µs is actually really long time https://t.co/fSZikc6218

Dinesh

@isDineshHere

10 days

Database System development will make you realise that 1 millisecond is actually a really really really long time ~300 tx/ms TigerBeetle vs 1 tx/ms PostgreSQL https://t.co/m4WbFfEPzO

0

1

Venkat Raman — inference/acc

@venkat_systems

9 days

5 hours with cursor <> 4.5 opus Kernel #1 - NVFP4 Batched GEMV beat my personal best 106 µs --> 55 µs gpu mode's Blackwell NVFP4 Kernel Hackathon

0

1

Venkat Raman — inference/acc

@venkat_systems

9 days

gemini-3-pro is cool, but it is unusable in cursor, i'm already on cli waitlist any better way to use it productively ? i guess google is still scaling up infra for this model

0