
Paul Calcraft
@paul_cal
Followers
6K
Following
30K
Media
799
Statuses
6K
AI is good & bad, actually. Tweeting about AI/ML methods, software dev, research, tech and society, social impact. 20yrs in tech, 10 in ML/AI, PhD in comp sci
London, England
Joined August 2013
The story of LLMs playing games, and what we know so far. Tic Tac Toe, Chess, Minecraft, NYT Connections, Wordle, Pictionary, Connect 4, Codenames, Snake. 1/n
22
114
1K
RT @OwainEvans_UK: New paper & surprising result. LLMs transmit traits to other models via hidden signals in data. Datasets consisting only….
0
1K
0
RT @paulcbogdan: New paper: What happens when an LLM reasons?. We created methods to interpret reasoning steps & their connections: resampl….
0
141
0
RT @nikhil07prakash: How do language models track mental states of each character in a story, often referred to as Theory of Mind?. Our rec….
0
96
0
RT @geoffreylitt: Check out this hot HCI paper about autonomous agents! It’s from… wait a sec… 1997?. “Researchers and software.companies h….
0
21
0
Claude code just fixed a bug in openai codex cli so I can run it without a sandbox on my VPS & choose a better model (o4-mini) as driver. Also told it how to use @simonw's llm cli to ask o3 if it needs help. & then Claude can make my codex iterate for longer than cloud codex too.
Looking deeper at the produced code, Codex is actually worse than I thought. It's not just unwilling to experiment & iterate on its own, but also way worse than expected at instruction following / understanding the task. Not sure what model(s) they tuned, but not good enough ime.
0
0
4
Anyone looked into LiveCodeBench-Pro in a bit more detail? Curious about typical failure modes, formatting issues/coding environment, and human baselines.
LLMs are far worse at competitive programming than we thought. Every one scored 0% on Hard problems. LiveCodeBench-Pro is a new benchmark with 584 always updating problems from IOI, ICPC and Codeforces. What's most interesting is the categories they perform really poorly on:
0
0
2
A month of codex. Seems unwilling/unable to run experiments, view results, tweak & try again for many iterations on its own. This is exactly what I want it for. Having to check back in and say "try again" or "this obviously doesn't fulfil the spec" is v annoying.
We’re launching a research preview of Codex: a cloud-based software engineering agent that can work on many tasks in parallel. Rolling out to Pro, Enterprise, and Team users in ChatGPT starting today.
1
0
18
Racial bias in LLMs isn't detectable in their chain of thought. They may be "unaware" they are doing it (at verbal reasoning lvl). Often surprised how well system 1 vs system 2 thinking in humans can apply to LLMs. Like gpt3.5ti being good at chess while unable to describe how.
Our setting also gives an example of unfaithful chain of thought in the wild. Across all models, inspecting CoTs gives 0 indication of race/gender bias, despite the outcomes themselves exhibiting clear bias. This includes Claude 4 Sonnet's internal reasoning.
0
0
7