Paul Calcraft @paul_cal X Profile

Paul Calcraft

@paul_cal

Followers

6K

Following

34K

Media

832

Statuses

7K

AI is good & bad, actually. Tweeting about AI/ML methods, software dev, research, tech and society, social impact. 20yrs in tech, 10 in ML/AI, PhD in comp sci

https://t.co/BnkkeM8c3s

London, England

Joined August 2013

Don't wanna be here? Send us removal request.

Paul Calcraft

@paul_cal

10 months

The story of LLMs playing games, and what we know so far Tic Tac Toe, Chess, Minecraft, NYT Connections, Wordle, Pictionary, Connect 4, Codenames, Snake... 1/n

21

112

1K

Prime Intellect

@PrimeIntellect

3 days

Introducing INTELLECT-3: Scaling RL to a 100B+ MoE model on our end-to-end stack Achieving state-of-the-art performance for its size across math, code and reasoning Built using the same tools we put in your hands, from environments & evals, RL frameworks, sandboxes & more

133

320

2K

Dr. Clark Store (Official)

@DrClarkStore

25 days

Detoxing heavy metals is imperative for everyone due to the many sources of them in our environment. Dr. Clark's two supplement kit has cilantro, EDTA, Shilajit and more to detox heavy metals, and an accompanying mineral supplement to restore lost minerals from cleansing.

0

14

61

Paul Calcraft

@paul_cal

3 days

Bro thought for 16 minutes before telling me that I didn't paste the code in correctly

1

0

5

Paul Calcraft

@paul_cal

5 days

Is there a word for the opposite of reward hacking? Opus achieved the goal but *failed* against the formal spec

Alex Albert

@alexalbert__

5 days

We had to remove the τ2-bench airline eval from our benchmarks table because Opus 4.5 broke it by being too clever. The benchmark simulates an airline customer service agent. In one test case, a distressed customer calls in wanting to change their flight, but they have a basic

1

0

6

Paul Calcraft

@paul_cal

7 days

Neuron learning y=mx+c function via gradient descent Quiz: - What's up with the ~magnetic attraction to that line? - Why is it diagonal? - What's the gradient, and why?

0

1

Beauty Packaging

@BeautyPackaging

5 days

Beautyworld Middle East 2025 is where beauty’s biggest trends and breakthroughs come to life. Meet buyers from 170+ countries, connect with top brands, and experience it all in Dubai! Oct 27–29, 2025 Dubai World Trade Centre Sponsored by @beautyworldME

0

1

Riley Goodside

@goodside

8 days

“Amateur photograph from 1998 of a middle-aged artist copying an image by hand from a computer screen to an oil painting on stretched canvas, but the image is itself the photo of the artist painting the recursive image.” Nano Banana Pro.

249

1K

12K

Jaketropolis

@jaketropolis

12 days

this comic always kills me every time. i laugh like an idiot whenever i remember it

206

2K

67K

Paul Calcraft

@paul_cal

12 days

If you're going to call it (S)earchable (L)og of (A)ll (C)onversations and (K)nowledge... How is your AI search so bad?

0

3

Bingyi Kang

@bingyikang

15 days

After a year of team work, we're thrilled to introduce Depth Anything 3 (DA3)! 🚀 Aiming for human-like spatial perception, DA3 extends monocular depth estimation to any-view scenarios, including single images, multi-view images, and video. In pursuit of minimal modeling, DA3

80

494

4K

Bruno

@use_bruno

1 month

Security should be default. Stop paying for what should be a free, local API client.

0

55

1K

Paul Calcraft

@paul_cal

22 days

"you would never *feel* shrunk"

0

2

Paul Calcraft

@paul_cal

23 days

Think you need to see this

Kimi.ai

@Kimi_Moonshot

23 days

🚀 Hello, Kimi K2 Thinking! The Open-Source Thinking Agent Model is here. 🔹 SOTA on HLE (44.9%) and BrowseComp (60.2%) 🔹 Executes up to 200 – 300 sequential tool calls without human interference 🔹 Excels in reasoning, agentic search, and coding 🔹 256K context window Built

0

2

Paul Calcraft

@paul_cal

29 days

There are more prompts in Heaven and Earth, pringle, than can be dreamt of in your philosophy Refuting bad takes on pringle paper: "Woah we thought LLMs created latent space abstractions but really they're just encoding the prompts directly!" No. Abstractions happen in high D

GLADIA Research Lab

@GladiaLab

1 month

LLMs are injective and invertible. In our new paper, we show that different prompts always map to different embeddings, and this property can be used to recover input tokens from individual embeddings in latent space. (1/6)

0

1

7

darren

@darrenangle

1 month

in the kimi-cli, the agent can "send a message to to the past", resetting itself to a known checkpoint and including a summary message or insruction "just like sending a D-Mail in Steins;Gate"

Han

@HanchungLee

1 month

finally we have a company building cli using standard languages instead of brain rots. go and python ftw. https://t.co/IFWBcPDwsH

38

37

615

Paul Calcraft

@paul_cal

1 month

I asked Sutton a year ago if he thought the bitter lesson suggested LLM post-training was doomed, it seemed to follow imo I'm glad Dwarkesh got us an answer

Paul Calcraft

@paul_cal

1 year

@RichardSSutton Do you think the bitter lesson implies the significant & grueling work on synthetic data pipelines for LLMs (v much about the contents of mind, not the architecture) will be superseded by something much more elegant? Synth approaches seem ad hoc & brittle, yet necessary for now

1

0

2

KneeOverToesGuy

@kneeovertoesguy

15 days

Lowest price for USA-made 100% organic cotton shorts @ATGUSAMade I hope these inspire someone out there to build closer to home, wherever that is.

18

52

566

Paul Calcraft

@paul_cal

1 month

@GladiaLab Have you/has anyone looked at adding privacy preserving noise to embeddings? For vector search use cases we're ranking distance on high-D space so I expect you can be pretty lossy while still v useful

GLADIA Research Lab

@GladiaLab

1 month

Language models are structurally lossless: - Hidden states do not compress or abstract the prompt; - Any system storing them effectively stores the input text itself; - This impacts privacy, deletion, and compliance: once data enters a Transformer, it remains recoverable. (5/6)

3

2

21

Paul Calcraft

@paul_cal

1 month

Feature activations on visual elements nicely track across text, ascii art and SVGs in Claude

julius tarng cyber inspector

@tarngerine

1 month

What happens when you turn a designer into an interpretability researcher? They spend hours staring at feature activations in SVG code to see if LLMs actually understand SVGs. It turns out – yes~ We found that semantic concepts transfer across text, ASCII, and SVG:

0

3

Paul Calcraft

@paul_cal

1 month

>gpt-4o-transcribe-diarize >recommend you run it offline Offline as in not realtime, not offline as in on-device/open source :(

Peter Bakkum

@pbbakkum

1 month

A small audio model launch -- gpt-4o-transcribe-diarize This is a diarization-focused ASR model, it's big and slow so we recommend running it offline, but it excels at differentiating speakers, and you can provide voice samples for known speakers up front.

0

1

Paul Calcraft

@paul_cal

1 month

Only 24% of a batch of AI written research papers were found to be plagiarised after deeper analysis This sounds surprisingly good? I don't know how good the contributions themselves are, I assume incremental at best

Alex Prompter

@alex_prompter

1 month

This paper just exposed the biggest AI research scam 💀 MIT just proved AI can generate novel research papers. Stanford confirmed it. OpenAI showcased examples. the papers passed peer review at major conferences. scored higher than human-written work on novelty and feasibility.

1

0

2

Revoke.cash

@RevokeCash

9 days

This is what wallet hygiene looks like in meme form.

31

76

510

Emmanuel Ameisen

@mlpowered

1 month

How does an LLM compare two numbers? We studied this in a common counting task, and were surprised to learn that the algorithm it used was: Put each number on a helix, and then twist one helix to compare it to the other. Not your first guess? Not ours either. 🧵

12

74

465

Paul Calcraft

@paul_cal

1 month

The nth order polynomial fit from WelchLabs recent vid is a v nice worked illustration of double descent in model size They also mention grokking (double descent where x axis is training time). iirc grokking can occur *without* regularisation, which is nuts. What's the theory?

0

3

Paul Calcraft

@paul_cal

1 month

@pli_cachete If true, RLVR can shift test time compute into posttraining, which is def valuable. Most deployed problems don't have easy pass@k supervisors, so this is still a genuine improvement in model intelligence Would be nice to keep the small % of valuable output diversity tho

0

1

6