N8Programs Profile Banner
N8 Programs Profile
N8 Programs

@N8Programs

Followers
7K
Following
4K
Media
508
Statuses
4K

Studying Applied Mathematics and Statistics at @JohnsHopkins. Studying In-Context Learning at The Intelligence Amplification Lab.

Proxima Centauri B
Joined September 2022
Don't wanna be here? Send us removal request.
@N8Programs
N8 Programs
14 hours
0
0
0
@N8Programs
N8 Programs
14 hours
LLMs struggle to count "r"s in strawberry due to tokenizers. But how significant is this limitation? Does it depend on model scale? Can it be overcome with ICL? How easy is training a model that can count characters? I answer all these questions + more in my latest Substack 👇
1
5
14
@N8Programs
N8 Programs
1 day
The rivers of code running when Claude 5 Finnegan wakes up
@chatgpt21
Chris
1 day
Claude 4.5 Opus has had an immaculate reception among developers. Now they’re about to get a 23B compute training upgrade in 2026 and 2027 to train Claude Large. They should call it Claude 5.5 Omega
0
0
6
@N8Programs
N8 Programs
1 day
broke: using a pre-built SFT framework for cloud training on an H100 woke: make a new MLX script for every project that re-implements the same thing with slight differences and yolo it on the H100
0
0
6
@N8Programs
N8 Programs
2 days
Can’t wait!
@natolambert
Nathan Lambert
2 days
@N8Programs yeah we had to move onto the next model and make efficiency improvements so we can do a 4 week training run in 1 week next time :D
0
0
1
@N8Programs
N8 Programs
2 days
Notable that many of the evals don't appear to have hit a ceiling - if they had more compute, they could push this even further - the model likely is nowhere near a ceiling.
@natolambert
Nathan Lambert
2 days
Olmo 3.1 32B Think shows that not just frontier labs can scale RL. My favorite RL run yet over 7+ years of doing RL. The biggest fully open RL run ever? We left the same RL job running from our v3 Think for an extra 3 weeks. When we were releasing Olmo 3 32B on Nov. 20th we had
2
1
20
@N8Programs
N8 Programs
2 days
Extremely impressive.
@allen_ai
Ai2
2 days
Olmo 3.1 is here. We extended our strongest RL run and scaled our instruct recipe to 32B—releasing Olmo 3.1 Think 32B & Olmo 3.1 Instruct 32B, our most capable models yet. 🧵
0
0
6
@N8Programs
N8 Programs
2 days
@kalomaze
kalomaze
2 days
@viemccoy fact flood: the worst way of measuring performance is also the best way
0
2
5
@N8Programs
N8 Programs
2 days
Immediate takeaway for GPT-5.2 is how long it works - here it spend 32min on Slides:
0
1
8
@tszzl
roon
2 days
two years ago it was hard to imagine that models were going to do grade school math much less completely saturate AIME. they couldn’t do the most basic problems, it was actually alarming. people forget
85
90
2K
@N8Programs
N8 Programs
2 days
Another qualitative jump in intelligence.
@OpenAI
OpenAI
3 days
On GDPval, an eval measuring well-specified knowledge work tasks across 44 occupations, GPT-5.2 Thinking is our first model that performs at a human expert level. These tasks include making presentations, spreadsheets, and other artifacts.
1
0
1
@mikeknoop
Mike Knoop
3 days
Pretty clear the latest Nov/Dec 2025 family of AI reasoning systems have significantly improved fluid intelligence over knowledge domains they were trained on (eg. code). Big step up from 6-9 months ago. ARC-AGI shows as much improvement.
@nbashaw
Nathan Baschez
4 days
the downstream effects of claude 4.5 opus will be studied
4
6
114
@N8Programs
N8 Programs
3 days
To summarize: ChatGPT (or any LLM) can both "know things" and "make things up".
@N8Programs
N8 Programs
3 days
@katherineveritt This is demonstrably false. Even just autocompleting sentences is enough to know things - if a model has lower perplexity (confusion) on: "The president of the US during the Civil War was Abraham Lincoln." (Perplexity: 12.375) "The president of the US during the Civil War was
1
0
3
@N8Programs
N8 Programs
4 days
Github exemption is v. important. Critical that young minds have access.
@BNODesk
BNO News Live
4 days
Australia's social media ban for teens under 16 is now in effect, the first country to do so
1
0
12
@N8Programs
N8 Programs
4 days
Note: all numbers in this thread are reported using **6-bit quants** as 4-bit quants (without any sort of learned quantization) tend to hurt agentic abilities/long-form reasoning.
@N8Programs
N8 Programs
4 days
After a bit of work (FP8 dequant, some debugging), got Mistral Vibe working on my M3 Max w/ Devstral 2 Small, running locally w/ LMStudio. It worked quite well in a quick demo, as shown below (everything below is happening 100% locally) - video at 4x speed:
1
0
17
@N8Programs
N8 Programs
4 days
Clarification - this is for *laptops*. An M3 Ultra will have no such issues.
@N8Programs
N8 Programs
4 days
The main drawback (at least on Metal) is the slow speed. While 24B's can run at 20tok/sec, this requires the entire GPU and is unsustainable for more than a few minutes. Typically, for long-running tasks, a Mac must be set to 'Low Power Mode' - and here, speed drops to 7tok/sec.
2
0
10
@N8Programs
N8 Programs
4 days
As a bonus, have a video (non-sped up) of Qwen3-Coder-30B-A3B solving the same task I gave Devstral 2:
0
2
10
@N8Programs
N8 Programs
4 days
My verdict: Devstral 2 is an extremely impressive model performance wise but decoding speed severely limits its utility on Mac *specifically* - where MoEs are best. It is far more practical if you have a 5090 which can fit it in VRAM and decode w/ FP8 accel at 70+ tok/sec.
1
0
12
@N8Programs
N8 Programs
4 days
What I am incredibly optimistic about is that squeezing this sort of perf out of a 24B model is no easy feat (knockout job my Mistral), and indicates similar perf could be given to an MOE in this size range (ie. a 30B-3B could reach this level with the right post-training).
2
0
9
@N8Programs
N8 Programs
4 days
While the prompt processing is alleviated by more compute (M5, any NVIDIA card), the single-decode speed is not. It's far less than what a similarly-sized MOE offers, and makes upgrading non-obvious. If you can run Qwen3-Coder-30B-A3B 3-5x faster, is Devstral 2 Small work +15% on
1
1
11