stewtong Profile Banner
Stew Tong Profile
Stew Tong

@stewtong

Followers
309
Following
2K
Media
73
Statuses
943

Big Papa. Music and muscle ups. Tech sales and Tequila. AI @ AWS. Building self-hosted LLMs and apps.

Los Angeles, CA
Joined April 2009
Don't wanna be here? Send us removal request.
@stewtong
Stew Tong
18 days
First time I've pushed an LLM deploy this far with a full benchmark and writeup. Sharing a gotcha so nobody else burns a Saturday: SM120 (RTX PRO 6000 Blackwell Server Edition) trips SGLang's default FP8 GEMM backends (DeepGemm + CUTLASS) on MiniMax-M2.5 (gated to SM100/SM90).
4
5
49
@stewtong
Stew Tong
6 days
Head-to-head vs M2.5 on the same hardware: Qwen3.5-122B: 1,985 burst tok/s, 310 online @ 4 rps, 6.99/10 Arena-Hard M2.5: 1,818 burst tok/s, 404 online @ 4 rps, 4.94/10 Arena-Hard Qwen3.5-122B wins on burst throughput and quality. M2.5 still wins on sustained serving, largely
Tweet card summary image
github.com
Summary Benchmarks of MiniMax-M2.5 (228B MoE) running FP8 inference on SM120 (RTX PRO 6000 Blackwell) via SGLang. I haven't seen prior published SM120 numbers for this model. Includes: SM120 ba...
0
0
1
@stewtong
Stew Tong
6 days
FP8 KV cache on Qwen3.5-122B doesn’t crash but it silently produces corrupt output. No error, no warning. Just exclamation marks and repetition instead of answers. bf16 KV fixes it. I benchmarked this on 8x RTX PRO 6000 Blackwell Server Edition (SM120, AWS g7e.48xlarge) with
1
0
4
@stewtong
Stew Tong
17 days
Sw stock beat down🤬 Agents exploding in the hands of non-tech humans means more open backdoors and more front-door attacks. Today it’s OpenClaw and we’re just getting started. This pattern exponentially reinforces the need for cyber solutions like $CRWD $PANW $ZS $NET $S to
0
0
0
@stewtong
Stew Tong
18 days
M2.5 had a lot of us in the lab this week! Dope work as always on H200 + vLLM. Just posted a complementary datapoint from the SM120 Blackwell Server Edition lane: in SGLang, default FP8 GEMM backends (DeepGemm/CUTLASS) can assert/crash due to SM90/SM100 gating. Switching FP8 GEMM
Tweet card summary image
github.com
Summary Benchmarks of MiniMax-M2.5 (228B MoE) running FP8 inference on SM120 (RTX PRO 6000 Blackwell) via SGLang. I haven't seen prior published SM120 numbers for this model. Includes: SM120 ba...
@SemiAnalysis_
SemiAnalysis
19 days
How efficient is MiniMax M2.5? We benchmarked on 8xH200 TEP8 with @vllm_project . At a reasonable 10-25s TTFT, M2.5 is able to sustain ~2500 tok/s/GPU throughput. For decode, it's still possible to reach ~20 tok/s/GPU throughput at a strict 20 tok/s/user interactivity with 10K+
0
0
1
@stewtong
Stew Tong
20 days
@AnthropicAI @OpenAI @cursor_ai Someone told me @bcherny might find this interesting re skills
0
0
0
@stewtong
Stew Tong
21 days
@AnthropicAI @OpenAI @cursor_ai Full implementation guide here:
1
0
0
@stewtong
Stew Tong
21 days
@AnthropicAI @OpenAI @cursor_ai This builds on my AI Native Knowledge Work article https://t.co/fRFmbjnNb3
0
0
0
@stewtong
Stew Tong
21 days
Honest question, how long until you make this whole thread irrelevant lol @AnthropicAI @OpenAI @cursor_ai
2
0
0
@stewtong
Stew Tong
21 days
I run enterprise AI GTM. I coded this entire system in a terminal. A year ago those two sentences didn't belong in the same paragraph. This pattern works with any AI CLI tool like Codex, OpenCode, Cursor with local file access. Full implementation guide with pseudo-code, file
2
0
0
@stewtong
Stew Tong
21 days
What doesn't work: The system can load wrong context at times. You'll need to tune your digital twin as your role changes. And critical thinking? Still on you. The AI is infrastructure, not judgment. But for knowledge work that requires juggling multiple contexts across many
1
0
0
@stewtong
Stew Tong
21 days
Real impact after 8 months: • Customer prep time: 1hr → 5min • Context switching time: 20min → 30sec The ROI isn't in the tool. It's in the compounding context.
1
0
0
@stewtong
Stew Tong
21 days
When I start the next session: /load-work The system reads my identity, recent memory, and current priorities. It knows what I was working on Friday, what's due this week, and which customers need attention. 30 seconds. No Slack archaeology. No "where was I?"
1
0
0
@stewtong
Stew Tong
21 days
Once your digital twin exists, the memory system kicks in. After every work session: /save-work "session summary" Behind the scenes: 1. Append to project/memory.md with timestamp 2. If https://t.co/1tIXW1RWBJ > 50KB, rotate to archive 3. Extract key insights to weekly file 4.
1
0
0
@stewtong
Stew Tong
21 days
Here's what a digital twin looks like for a GTM specialist: Role & Products - Enterprise Sales, AI Platform - Products: Name, key differentiators, roadmap - Target: Fortune 500, Digital Natives Current Focus - Q1 pipeline: Enterprise deals in FinServ - Product launch: Extended
1
0
0
@stewtong
Stew Tong
21 days
The digital twin is a structured markdown file. It answers: - Who are you? - What do you do? - What are you working on? - How should the AI help you? In Claude Code, https://t.co/W91OEe43O3 is your lightweight system config (auto-loaded, <1000 tokens). Your digital twin lives
Tweet card summary image
code.claude.com
Claude Code is an agentic coding tool that reads your codebase, edits files, runs commands, and integrates with your development tools. Available in your terminal, IDE, desktop app, and browser.
1
0
0
@stewtong
Stew Tong
21 days
The system has two layers: Layer 1: Define your digital twin once (1-2 hours of focused work) Layer 2: Create a memory system to let context compound session by session Most people skip Layer 1 and wonder why their AI assistant feels generic. The AI doesn't know what you do,
1
0
1
@stewtong
Stew Tong
21 days
After 8 months building my GTM operating system on Claude Code the same question kept coming back: "How does someone start without 8 months of accumulated context?" Here's the truth: managing multiple enterprise customers, product lines, and strategic initiatives used to mean
1
0
0
@stewtong
Stew Tong
27 days
J Cole one-shotted The Fall-Off using Openclaw. Unreal
0
0
1