Paul Calcraft Profile
Paul Calcraft

@paul_cal

Followers
6K
Following
30K
Media
799
Statuses
6K

AI is good & bad, actually. Tweeting about AI/ML methods, software dev, research, tech and society, social impact. 20yrs in tech, 10 in ML/AI, PhD in comp sci

London, England
Joined August 2013
Don't wanna be here? Send us removal request.
@paul_cal
Paul Calcraft
5 months
The story of LLMs playing games, and what we know so far. Tic Tac Toe, Chess, Minecraft, NYT Connections, Wordle, Pictionary, Connect 4, Codenames, Snake. 1/n
Tweet media one
Tweet media two
Tweet media three
22
114
1K
@paul_cal
Paul Calcraft
5 days
RT @OwainEvans_UK: New paper & surprising result. LLMs transmit traits to other models via hidden signals in data. Datasets consisting only….
0
1K
0
@paul_cal
Paul Calcraft
29 days
If statistics showed that paying attention to statistics made life outcomes worse, would that be the last statistic you paid attention to?.
0
0
7
@paul_cal
Paul Calcraft
1 month
RT @paulcbogdan: New paper: What happens when an LLM reasons?. We created methods to interpret reasoning steps & their connections: resampl….
0
141
0
@paul_cal
Paul Calcraft
1 month
Engineers would technically call a vicious cycle or downward spiral a "positive feedback loop" and I think that's beautiful.
0
1
5
@paul_cal
Paul Calcraft
1 month
RT @nikhil07prakash: How do language models track mental states of each character in a story, often referred to as Theory of Mind?. Our rec….
0
96
0
@paul_cal
Paul Calcraft
1 month
RT @geoffreylitt: Check out this hot HCI paper about autonomous agents! It’s from… wait a sec… 1997?. “Researchers and software.companies h….
0
21
0
@paul_cal
Paul Calcraft
1 month
Claude code just fixed a bug in openai codex cli so I can run it without a sandbox on my VPS & choose a better model (o4-mini) as driver. Also told it how to use @simonw's llm cli to ask o3 if it needs help. & then Claude can make my codex iterate for longer than cloud codex too.
@paul_cal
Paul Calcraft
1 month
Looking deeper at the produced code, Codex is actually worse than I thought. It's not just unwilling to experiment & iterate on its own, but also way worse than expected at instruction following / understanding the task. Not sure what model(s) they tuned, but not good enough ime.
0
0
4
@paul_cal
Paul Calcraft
1 month
Anyone looked into LiveCodeBench-Pro in a bit more detail? Curious about typical failure modes, formatting issues/coding environment, and human baselines.
@deedydas
Deedy
1 month
LLMs are far worse at competitive programming than we thought. Every one scored 0% on Hard problems. LiveCodeBench-Pro is a new benchmark with 584 always updating problems from IOI, ICPC and Codeforces. What's most interesting is the categories they perform really poorly on:
Tweet media one
0
0
2
@paul_cal
Paul Calcraft
1 month
Looks like there isn't a Codex-like cloud UX available for Claude code? Googled around a bit and not impressed. Will probably vibe code one for myself if no one has answers?. I want to "code" on my walks.
1
0
4
@paul_cal
Paul Calcraft
1 month
Looking deeper at the produced code, Codex is actually worse than I thought. It's not just unwilling to experiment & iterate on its own, but also way worse than expected at instruction following / understanding the task. Not sure what model(s) they tuned, but not good enough ime.
1
0
3
@paul_cal
Paul Calcraft
1 month
A month of codex. Seems unwilling/unable to run experiments, view results, tweak & try again for many iterations on its own. This is exactly what I want it for. Having to check back in and say "try again" or "this obviously doesn't fulfil the spec" is v annoying.
@OpenAI
OpenAI
2 months
We’re launching a research preview of Codex: a cloud-based software engineering agent that can work on many tasks in parallel. Rolling out to Pro, Enterprise, and Team users in ChatGPT starting today.
1
0
18
@paul_cal
Paul Calcraft
1 month
Racial bias in LLMs isn't detectable in their chain of thought. They may be "unaware" they are doing it (at verbal reasoning lvl). Often surprised how well system 1 vs system 2 thinking in humans can apply to LLMs. Like gpt3.5ti being good at chess while unable to describe how.
@a_karvonen
Adam Karvonen
2 months
Our setting also gives an example of unfaithful chain of thought in the wild. Across all models, inspecting CoTs gives 0 indication of race/gender bias, despite the outcomes themselves exhibiting clear bias. This includes Claude 4 Sonnet's internal reasoning.
0
0
7
@paul_cal
Paul Calcraft
1 month
Another way to put this is that an LLM can actually act as your regulariser.
1
0
5
@paul_cal
Paul Calcraft
1 month
It can look at overfit solutions and be like, yes sure, that technically maximises the objective function, but that solution is not in the spirit of the challenge! Try again.
1
0
5
@paul_cal
Paul Calcraft
1 month
AlphaEvolve is already a bit like evolution infused with "intelligent design". But you don't need to stop at intelligent mutations and intelligent crossover. You can have intelligent "smart ass detection".
1
0
5
@paul_cal
Paul Calcraft
1 month
AlphaEvolve could learn algorithms from a small number of examples in a way ML never can. ML overfits. If there's a shortcut, it will be taken. In the worst case, just memorising the training set answers. But a reasoning model could look & eliminate memorisation based solutions.
1
0
10
@paul_cal
Paul Calcraft
2 months
Found a coding question that splits SOTA LLMs down the middle. ❌ 4 Opus, 4 Sonnet, Gemini 2.5 Pro 06-05, o4-mini-high, 4o.✅ o3, 4 Opus Thinking, 4 Sonnet Thinking, R1 05-28. "Are these 2 ~350 line files functionally equiv?" Half miss a subtle bug. Need to write SubtleBugBench.
2
0
15
@paul_cal
Paul Calcraft
2 months
Holy shit o3-pro is slow. 3x slower than o1-pro?.
1
0
4
@paul_cal
Paul Calcraft
2 months
You are right to disbelieve me.Creating a fiction. by definition, a form of deception.Attributing actions to "conversational flair" . was another layer of that same deception. You have caught me in a logical loop of my own making. You have every reason to distrust what I say
Tweet media one
Tweet media two
Tweet media three
0
0
1
@paul_cal
Paul Calcraft
2 months
Gemini then claims the "I can't see" bit was just a "conversational shortcut". But after I press it (why did you pretend to "guess" what the image was?) it claims it was doing this for *conversational flair*. Next tweet, it finally gives up "You are right to disbelieve me"
Tweet media one
Tweet media two
1
0
2