Paul Calcraft Profile
Paul Calcraft

@paul_cal

Followers
6K
Following
34K
Media
832
Statuses
7K

AI is good & bad, actually. Tweeting about AI/ML methods, software dev, research, tech and society, social impact. 20yrs in tech, 10 in ML/AI, PhD in comp sci

London, England
Joined August 2013
Don't wanna be here? Send us removal request.
@paul_cal
Paul Calcraft
10 months
The story of LLMs playing games, and what we know so far Tic Tac Toe, Chess, Minecraft, NYT Connections, Wordle, Pictionary, Connect 4, Codenames, Snake... 1/n
21
112
1K
@PrimeIntellect
Prime Intellect
3 days
Introducing INTELLECT-3: Scaling RL to a 100B+ MoE model on our end-to-end stack Achieving state-of-the-art performance for its size across math, code and reasoning Built using the same tools we put in your hands, from environments & evals, RL frameworks, sandboxes & more
133
320
2K
@DrClarkStore
Dr. Clark Store (Official)
25 days
Detoxing heavy metals is imperative for everyone due to the many sources of them in our environment. Dr. Clark's two supplement kit has cilantro, EDTA, Shilajit and more to detox heavy metals, and an accompanying mineral supplement to restore lost minerals from cleansing.
0
14
61
@paul_cal
Paul Calcraft
3 days
Bro thought for 16 minutes before telling me that I didn't paste the code in correctly
1
0
5
@paul_cal
Paul Calcraft
5 days
Is there a word for the opposite of reward hacking? Opus achieved the goal but *failed* against the formal spec
@alexalbert__
Alex Albert
5 days
We had to remove the τ2-bench airline eval from our benchmarks table because Opus 4.5 broke it by being too clever. The benchmark simulates an airline customer service agent. In one test case, a distressed customer calls in wanting to change their flight, but they have a basic
1
0
6
@paul_cal
Paul Calcraft
7 days
Neuron learning y=mx+c function via gradient descent Quiz: - What's up with the ~magnetic attraction to that line? - Why is it diagonal? - What's the gradient, and why?
0
0
1
@BeautyPackaging
Beauty Packaging
5 days
Beautyworld Middle East 2025 is where beauty’s biggest trends and breakthroughs come to life. Meet buyers from 170+ countries, connect with top brands, and experience it all in Dubai! Oct 27–29, 2025 Dubai World Trade Centre Sponsored by @beautyworldME
0
0
1
@goodside
Riley Goodside
8 days
“Amateur photograph from 1998 of a middle-aged artist copying an image by hand from a computer screen to an oil painting on stretched canvas, but the image is itself the photo of the artist painting the recursive image.” Nano Banana Pro.
249
1K
12K
@jaketropolis
Jaketropolis
12 days
this comic always kills me every time. i laugh like an idiot whenever i remember it
206
2K
67K
@paul_cal
Paul Calcraft
12 days
If you're going to call it (S)earchable (L)og of (A)ll (C)onversations and (K)nowledge... How is your AI search so bad?
0
0
3
@bingyikang
Bingyi Kang
15 days
After a year of team work, we're thrilled to introduce Depth Anything 3 (DA3)! 🚀 Aiming for human-like spatial perception, DA3 extends monocular depth estimation to any-view scenarios, including single images, multi-view images, and video. In pursuit of minimal modeling, DA3
80
494
4K
@use_bruno
Bruno
1 month
Security should be default. Stop paying for what should be a free, local API client.
0
55
1K
@paul_cal
Paul Calcraft
22 days
"you would never *feel* shrunk"
0
0
2
@paul_cal
Paul Calcraft
23 days
Think you need to see this
@Kimi_Moonshot
Kimi.ai
23 days
🚀 Hello, Kimi K2 Thinking! The Open-Source Thinking Agent Model is here. 🔹 SOTA on HLE (44.9%) and BrowseComp (60.2%) 🔹 Executes up to 200 – 300 sequential tool calls without human interference 🔹 Excels in reasoning, agentic search, and coding 🔹 256K context window Built
0
0
2
@paul_cal
Paul Calcraft
29 days
There are more prompts in Heaven and Earth, pringle, than can be dreamt of in your philosophy Refuting bad takes on pringle paper: "Woah we thought LLMs created latent space abstractions but really they're just encoding the prompts directly!" No. Abstractions happen in high D
@GladiaLab
GLADIA Research Lab
1 month
LLMs are injective and invertible. In our new paper, we show that different prompts always map to different embeddings, and this property can be used to recover input tokens from individual embeddings in latent space. (1/6)
0
1
7
@darrenangle
darren
1 month
in the kimi-cli, the agent can "send a message to to the past", resetting itself to a known checkpoint and including a summary message or insruction "just like sending a D-Mail in Steins;Gate"
@HanchungLee
Han
1 month
finally we have a company building cli using standard languages instead of brain rots. go and python ftw. https://t.co/IFWBcPDwsH
38
37
615
@paul_cal
Paul Calcraft
1 month
I asked Sutton a year ago if he thought the bitter lesson suggested LLM post-training was doomed, it seemed to follow imo I'm glad Dwarkesh got us an answer
@paul_cal
Paul Calcraft
1 year
@RichardSSutton Do you think the bitter lesson implies the significant & grueling work on synthetic data pipelines for LLMs (v much about the contents of mind, not the architecture) will be superseded by something much more elegant? Synth approaches seem ad hoc & brittle, yet necessary for now
1
0
2
@kneeovertoesguy
KneeOverToesGuy
15 days
Lowest price for USA-made 100% organic cotton shorts @ATGUSAMade I hope these inspire someone out there to build closer to home, wherever that is.
18
52
566
@paul_cal
Paul Calcraft
1 month
@GladiaLab Have you/has anyone looked at adding privacy preserving noise to embeddings? For vector search use cases we're ranking distance on high-D space so I expect you can be pretty lossy while still v useful
@GladiaLab
GLADIA Research Lab
1 month
Language models are structurally lossless: - Hidden states do not compress or abstract the prompt; - Any system storing them effectively stores the input text itself; - This impacts privacy, deletion, and compliance: once data enters a Transformer, it remains recoverable. (5/6)
3
2
21
@paul_cal
Paul Calcraft
1 month
Feature activations on visual elements nicely track across text, ascii art and SVGs in Claude
@tarngerine
julius tarng cyber inspector
1 month
What happens when you turn a designer into an interpretability researcher? They spend hours staring at feature activations in SVG code to see if LLMs actually understand SVGs. It turns out – yes~ We found that semantic concepts transfer across text, ASCII, and SVG:
0
0
3
@paul_cal
Paul Calcraft
1 month
>gpt-4o-transcribe-diarize >recommend you run it offline Offline as in not realtime, not offline as in on-device/open source :(
@pbbakkum
Peter Bakkum
1 month
A small audio model launch -- gpt-4o-transcribe-diarize This is a diarization-focused ASR model, it's big and slow so we recommend running it offline, but it excels at differentiating speakers, and you can provide voice samples for known speakers up front.
0
0
1
@paul_cal
Paul Calcraft
1 month
Only 24% of a batch of AI written research papers were found to be plagiarised after deeper analysis This sounds surprisingly good? I don't know how good the contributions themselves are, I assume incremental at best
@alex_prompter
Alex Prompter
1 month
This paper just exposed the biggest AI research scam 💀 MIT just proved AI can generate novel research papers. Stanford confirmed it. OpenAI showcased examples. the papers passed peer review at major conferences. scored higher than human-written work on novelty and feasibility.
1
0
2
@RevokeCash
Revoke.cash
9 days
This is what wallet hygiene looks like in meme form.
31
76
510
@mlpowered
Emmanuel Ameisen
1 month
How does an LLM compare two numbers? We studied this in a common counting task, and were surprised to learn that the algorithm it used was: Put each number on a helix, and then twist one helix to compare it to the other. Not your first guess? Not ours either. 🧵
12
74
465
@paul_cal
Paul Calcraft
1 month
The nth order polynomial fit from WelchLabs recent vid is a v nice worked illustration of double descent in model size They also mention grokking (double descent where x axis is training time). iirc grokking can occur *without* regularisation, which is nuts. What's the theory?
0
0
3
@paul_cal
Paul Calcraft
1 month
@pli_cachete If true, RLVR can shift test time compute into posttraining, which is def valuable. Most deployed problems don't have easy pass@k supervisors, so this is still a genuine improvement in model intelligence Would be nice to keep the small % of valuable output diversity tho
0
1
6