N8 Programs
@N8Programs
Followers
7K
Following
4K
Media
508
Statuses
4K
Studying Applied Mathematics and Statistics at @JohnsHopkins. Studying In-Context Learning at The Intelligence Amplification Lab.
Proxima Centauri B
Joined September 2022
LLMs struggle to count "r"s in strawberry due to tokenizers. But how significant is this limitation? Does it depend on model scale? Can it be overcome with ICL? How easy is training a model that can count characters? I answer all these questions + more in my latest Substack 👇
1
5
14
broke: using a pre-built SFT framework for cloud training on an H100 woke: make a new MLX script for every project that re-implements the same thing with slight differences and yolo it on the H100
0
0
6
Can’t wait!
@N8Programs yeah we had to move onto the next model and make efficiency improvements so we can do a 4 week training run in 1 week next time :D
0
0
1
Notable that many of the evals don't appear to have hit a ceiling - if they had more compute, they could push this even further - the model likely is nowhere near a ceiling.
Olmo 3.1 32B Think shows that not just frontier labs can scale RL. My favorite RL run yet over 7+ years of doing RL. The biggest fully open RL run ever? We left the same RL job running from our v3 Think for an extra 3 weeks. When we were releasing Olmo 3 32B on Nov. 20th we had
2
1
20
Immediate takeaway for GPT-5.2 is how long it works - here it spend 32min on Slides:
0
1
8
two years ago it was hard to imagine that models were going to do grade school math much less completely saturate AIME. they couldn’t do the most basic problems, it was actually alarming. people forget
85
90
2K
Pretty clear the latest Nov/Dec 2025 family of AI reasoning systems have significantly improved fluid intelligence over knowledge domains they were trained on (eg. code). Big step up from 6-9 months ago. ARC-AGI shows as much improvement.
4
6
114
To summarize: ChatGPT (or any LLM) can both "know things" and "make things up".
@katherineveritt This is demonstrably false. Even just autocompleting sentences is enough to know things - if a model has lower perplexity (confusion) on: "The president of the US during the Civil War was Abraham Lincoln." (Perplexity: 12.375) "The president of the US during the Civil War was
1
0
3
Note: all numbers in this thread are reported using **6-bit quants** as 4-bit quants (without any sort of learned quantization) tend to hurt agentic abilities/long-form reasoning.
After a bit of work (FP8 dequant, some debugging), got Mistral Vibe working on my M3 Max w/ Devstral 2 Small, running locally w/ LMStudio. It worked quite well in a quick demo, as shown below (everything below is happening 100% locally) - video at 4x speed:
1
0
17
Clarification - this is for *laptops*. An M3 Ultra will have no such issues.
The main drawback (at least on Metal) is the slow speed. While 24B's can run at 20tok/sec, this requires the entire GPU and is unsustainable for more than a few minutes. Typically, for long-running tasks, a Mac must be set to 'Low Power Mode' - and here, speed drops to 7tok/sec.
2
0
10
As a bonus, have a video (non-sped up) of Qwen3-Coder-30B-A3B solving the same task I gave Devstral 2:
0
2
10
My verdict: Devstral 2 is an extremely impressive model performance wise but decoding speed severely limits its utility on Mac *specifically* - where MoEs are best. It is far more practical if you have a 5090 which can fit it in VRAM and decode w/ FP8 accel at 70+ tok/sec.
1
0
12
What I am incredibly optimistic about is that squeezing this sort of perf out of a 24B model is no easy feat (knockout job my Mistral), and indicates similar perf could be given to an MOE in this size range (ie. a 30B-3B could reach this level with the right post-training).
2
0
9
While the prompt processing is alleviated by more compute (M5, any NVIDIA card), the single-decode speed is not. It's far less than what a similarly-sized MOE offers, and makes upgrading non-obvious. If you can run Qwen3-Coder-30B-A3B 3-5x faster, is Devstral 2 Small work +15% on
1
1
11