Paul Calcraft
@paul_cal
Followers
6K
Following
34K
Media
832
Statuses
7K
AI is good & bad, actually. Tweeting about AI/ML methods, software dev, research, tech and society, social impact. 20yrs in tech, 10 in ML/AI, PhD in comp sci
London, England
Joined August 2013
The story of LLMs playing games, and what we know so far Tic Tac Toe, Chess, Minecraft, NYT Connections, Wordle, Pictionary, Connect 4, Codenames, Snake... 1/n
21
112
1K
Introducing INTELLECT-3: Scaling RL to a 100B+ MoE model on our end-to-end stack Achieving state-of-the-art performance for its size across math, code and reasoning Built using the same tools we put in your hands, from environments & evals, RL frameworks, sandboxes & more
133
320
2K
Detoxing heavy metals is imperative for everyone due to the many sources of them in our environment. Dr. Clark's two supplement kit has cilantro, EDTA, Shilajit and more to detox heavy metals, and an accompanying mineral supplement to restore lost minerals from cleansing.
0
14
61
Bro thought for 16 minutes before telling me that I didn't paste the code in correctly
1
0
5
Is there a word for the opposite of reward hacking? Opus achieved the goal but *failed* against the formal spec
We had to remove the τ2-bench airline eval from our benchmarks table because Opus 4.5 broke it by being too clever. The benchmark simulates an airline customer service agent. In one test case, a distressed customer calls in wanting to change their flight, but they have a basic
1
0
6
Neuron learning y=mx+c function via gradient descent Quiz: - What's up with the ~magnetic attraction to that line? - Why is it diagonal? - What's the gradient, and why?
0
0
1
Beautyworld Middle East 2025 is where beauty’s biggest trends and breakthroughs come to life. Meet buyers from 170+ countries, connect with top brands, and experience it all in Dubai! Oct 27–29, 2025 Dubai World Trade Centre Sponsored by @beautyworldME
0
0
1
“Amateur photograph from 1998 of a middle-aged artist copying an image by hand from a computer screen to an oil painting on stretched canvas, but the image is itself the photo of the artist painting the recursive image.” Nano Banana Pro.
249
1K
12K
this comic always kills me every time. i laugh like an idiot whenever i remember it
206
2K
67K
If you're going to call it (S)earchable (L)og of (A)ll (C)onversations and (K)nowledge... How is your AI search so bad?
0
0
3
After a year of team work, we're thrilled to introduce Depth Anything 3 (DA3)! 🚀 Aiming for human-like spatial perception, DA3 extends monocular depth estimation to any-view scenarios, including single images, multi-view images, and video. In pursuit of minimal modeling, DA3
80
494
4K
Security should be default. Stop paying for what should be a free, local API client.
0
55
1K
Think you need to see this
🚀 Hello, Kimi K2 Thinking! The Open-Source Thinking Agent Model is here. 🔹 SOTA on HLE (44.9%) and BrowseComp (60.2%) 🔹 Executes up to 200 – 300 sequential tool calls without human interference 🔹 Excels in reasoning, agentic search, and coding 🔹 256K context window Built
0
0
2
There are more prompts in Heaven and Earth, pringle, than can be dreamt of in your philosophy Refuting bad takes on pringle paper: "Woah we thought LLMs created latent space abstractions but really they're just encoding the prompts directly!" No. Abstractions happen in high D
LLMs are injective and invertible. In our new paper, we show that different prompts always map to different embeddings, and this property can be used to recover input tokens from individual embeddings in latent space. (1/6)
0
1
7
in the kimi-cli, the agent can "send a message to to the past", resetting itself to a known checkpoint and including a summary message or insruction "just like sending a D-Mail in Steins;Gate"
finally we have a company building cli using standard languages instead of brain rots. go and python ftw. https://t.co/IFWBcPDwsH
38
37
615
I asked Sutton a year ago if he thought the bitter lesson suggested LLM post-training was doomed, it seemed to follow imo I'm glad Dwarkesh got us an answer
@RichardSSutton Do you think the bitter lesson implies the significant & grueling work on synthetic data pipelines for LLMs (v much about the contents of mind, not the architecture) will be superseded by something much more elegant? Synth approaches seem ad hoc & brittle, yet necessary for now
1
0
2
Lowest price for USA-made 100% organic cotton shorts @ATGUSAMade I hope these inspire someone out there to build closer to home, wherever that is.
18
52
566
@GladiaLab Have you/has anyone looked at adding privacy preserving noise to embeddings? For vector search use cases we're ranking distance on high-D space so I expect you can be pretty lossy while still v useful
Language models are structurally lossless: - Hidden states do not compress or abstract the prompt; - Any system storing them effectively stores the input text itself; - This impacts privacy, deletion, and compliance: once data enters a Transformer, it remains recoverable. (5/6)
3
2
21
Feature activations on visual elements nicely track across text, ascii art and SVGs in Claude
What happens when you turn a designer into an interpretability researcher? They spend hours staring at feature activations in SVG code to see if LLMs actually understand SVGs. It turns out – yes~ We found that semantic concepts transfer across text, ASCII, and SVG:
0
0
3
>gpt-4o-transcribe-diarize >recommend you run it offline Offline as in not realtime, not offline as in on-device/open source :(
A small audio model launch -- gpt-4o-transcribe-diarize This is a diarization-focused ASR model, it's big and slow so we recommend running it offline, but it excels at differentiating speakers, and you can provide voice samples for known speakers up front.
0
0
1
Only 24% of a batch of AI written research papers were found to be plagiarised after deeper analysis This sounds surprisingly good? I don't know how good the contributions themselves are, I assume incremental at best
This paper just exposed the biggest AI research scam 💀 MIT just proved AI can generate novel research papers. Stanford confirmed it. OpenAI showcased examples. the papers passed peer review at major conferences. scored higher than human-written work on novelty and feasibility.
1
0
2
How does an LLM compare two numbers? We studied this in a common counting task, and were surprised to learn that the algorithm it used was: Put each number on a helix, and then twist one helix to compare it to the other. Not your first guess? Not ours either. 🧵
12
74
465
The nth order polynomial fit from WelchLabs recent vid is a v nice worked illustration of double descent in model size They also mention grokking (double descent where x axis is training time). iirc grokking can occur *without* regularisation, which is nuts. What's the theory?
0
0
3
@pli_cachete If true, RLVR can shift test time compute into posttraining, which is def valuable. Most deployed problems don't have easy pass@k supervisors, so this is still a genuine improvement in model intelligence Would be nice to keep the small % of valuable output diversity tho
0
1
6