Maxim Khomiakov
@maximkhv
Followers
325
Following
787
Media
6
Statuses
97
managing context windows
Copenhagen
Joined May 2018
"how can flash beat pro??" -> the answer is RL! flash is not just a distilled pro. we've had lots of exciting research progress on agentic RL which made its way into flash but was too late for pro. can't wait to finally bring them to pro👀
116
267
4K
agree, its truly a great release. I'd recommend @natolambert NeurIPS talk too
Olmo 3 is one of the most valuable open research artifacts to ever be released. Although Olmo 3 models are slightly behind state-of-the-art, their value goes beyond the models themselves. The artifacts for Olmo 3 give anyone the ability to conduct rigorous experiments with
1
1
3
MMaDA-Parallel Multimodal Large Diffusion Language Models for Thinking-Aware Editing and Generation
11
60
451
Found a bug with this tool just recently, can recommend. The ability to drill down in your LLM traces in an easy fashion is quite useful. When you treat traces like training data, you may need to view it in a different way.
0
2
4
> The way humans think look a lot more like diffusion than autoregressive. i will never, ever understand this claim or the intuitions behind it. ah yes. the human mind is... learning a scoring function to... reverse gaussian noise... (?) ... spatially (???)
Naive question, so please roast me. Why don't we have diffusion reasoning models? The way humans think look a lot more like diffusion than autoregressive.
57
9
515
> There’s no free lunch. > When you reduce the complexity of attention, you pay a price. > The question is, where? This is *exactly* how I typically end my Transformer tutorial. This slide is already 4 years old, I've never updated it, but it still holds:
MiniMax M2 Tech Blog 3: Why Did M2 End Up as a Full Attention Model? On behave of pre-training lead Haohai Sun. ( https://t.co/WH4xOD9KrT) I. Introduction As the lead of MiniMax-M2 pretrain, I've been getting many queries from the community on "Why did you turn back the clock
35
61
901
Huge breakthrough from DeepMind! In their latest Nature paper, “Discovering state-of-the-art reinforcement learning algorithms,” they show that AI can autonomously discover better RL algorithms. "Enabling machines to discover learning algorithms for themselves is one of the
49
260
2K
LLMs are injective and invertible. In our new paper, we show that different prompts always map to different embeddings, and this property can be used to recover input tokens from individual embeddings in latent space. (1/6)
283
1K
11K
Bai et al., "Positional Encoding Field" Make your RoPE encoding 3D by including a z axis, then manipulate your image by simply manipulating your positional encoding in 3D --> novel view synthesis. Neat idea.
7
55
480
We’re hiring research engineers, software engineers, silicon engineers, and operators at @NormalComputing. Join us as we rethink ASICs with Physics and AI. We have offices and are hiring across NYC, LON, CPH, and SF. We are looking for folks with backgrounds across RL/agents,
4
4
16
prob one of the funniest LPs around. dealflow must be off the charts
I struggle to think of ONE exciting startup that's come out of the USA in the last 10 years All that country can produce is Ponzi schemes Meanwhile, Europe continues to lead the way in building safe, compliant companies that help people and the government
0
1
3
"What do 1M and 500K context windows have in common? They are both actually 64K."
New post! This time, about the current state of Long Context Evaluation. I discuss existing benchmarks, what makes a good long context eval, what's missing from existing ones and introduce a new one - LongCodeEdit :)
31
65
1K
I’m really tired of labs pointing to swe bench as a sign that their new tiny model is SOTA. It’s just a bunch of python problems that are well leaked into the training set. Tiny models can rarely replace larger ones with any context pressure in practice.
Introducing Claude Haiku 4.5: our latest small model. Five months ago, Claude Sonnet 4 was state-of-the-art. Today, Haiku 4.5 matches its coding performance at one-third the cost and more than twice the speed.
20
14
271
1M tokens are not all equal, sparse attention etc.
So this is a first for me. I just had a pretty big refactoring session with codex-cli, and eventually it started going completely off the rails. It became very dumb, made bad mistakes, only followed half my instructions and made up the other half, misused tools in ways so stupid
0
0
0