_tom_bush Profile Banner
Tom Bush Profile
Tom Bush

@_tom_bush

Followers
144
Following
53
Media
12
Statuses
34

AI Alignment and Interpretability | Research Scholar @ MATS | Incoming DPhil @ Oxford

Joined November 2023
Don't wanna be here? Send us removal request.
@_tom_bush
Tom Bush
4 months
🤖 !! Model-free agents can internally plan !! 🤖. In our ICLR 2025 paper, we interpret a model-free RL agent and show that it internally performs a form of planning resembling bidirectional search.
Tweet media one
7
46
322
@_tom_bush
Tom Bush
3 months
Attending #ICLR2025 to present this research - please reach out if you want to chat about interp, reasoning, or anything else!!.
@farairesearch
FAR.AI
3 months
🤖 Model-free agents can internally plan!. Sokoban agents develop bidirectional search planning! .🔬 We probe for planning concepts .⚙️ Investigate plan formation .✅ Verify plans impact behavior . Chat with us at #ICLR2025!.📍 Apr 24: Poster 10am + Oral 4:06pm SGT
Tweet media one
0
2
17
@_tom_bush
Tom Bush
3 months
RT @_fernando_rosas: Preprint time:.“AI in a vat: Fundamental limits of efficient world modelling for agent sandboxing and interpretability….
0
20
0
@_tom_bush
Tom Bush
4 months
This research would not have been possible without the amazing support and mentorship I recieved from my brilliant co-authors @Stephen36910351 , @usmananwar391 , @AdriGarriga and @DavidSKrueger. Paper: Blog post:
0
0
6
@_tom_bush
Tom Bush
4 months
This research would not have been possible without the amazing support and mentorship I recieved from my brilliant co-authors @Stephen36910351 , @usmananwar391 , @AdriGarriga and @DavidSKrueger . Paper: Blog post:
0
0
6
@_tom_bush
Tom Bush
4 months
For instance, in the example below, the agent initially plans to push the upper-left box to the center-most by pushing it down and then right. However, the agent realises this is infeasible – once the box is been pushed down, it is stuck – and then forms an alternate plan.
1
0
6
@_tom_bush
Tom Bush
4 months
Furthermore, the agent has learned something akin to the “a-ha!” moments that have been exhibited by RL-tuned LLMs like DeepSeek-R1. That is, the agent often (1) forms initial plans, (2) recognize flaws in its these plans, and then (3) update its initial plans accordingly.
1
1
3
@_tom_bush
Tom Bush
4 months
For instance, in planning forward from boxes and backward from targets, the DRC agent has learned an algorithm that is especially efficient in Sokoban. We think this explains its ability to outperform model-based agents (whose planning depends on handcrafted elements) in Sokoban.
1
0
3
@_tom_bush
Tom Bush
4 months
Why does all of this matter? Because, it shows that generic model-free training can give rise to the emergence of advanced, complex reasoning capabilities, even in tiny agents!.
1
0
6
@_tom_bush
Tom Bush
4 months
We also find that these planning representations emerge concurrently with the agent's ability to perform better when given time to "think" at the start of episodes.
Tweet media one
1
0
5
@_tom_bush
Tom Bush
4 months
We find that we can intervene on the agent’s activations to force it to form and execute alternate plans, changing its behaviour over entire episodes. For instance, below, we intervene to cause the agent to form and execute a sub-optimally long plan.
1
0
5
@_tom_bush
Tom Bush
4 months
In the final step of our analysis, we investigate whether the internal plan representations we uncover are causally linked to the agent's behaviour.
1
0
3
@_tom_bush
Tom Bush
4 months
We even find that, when forced to pause and "think" at the start of episodes, the agent's internal plan iteratively improves. This ability to iteratively improve plans explains the phenomenon noted above where the DRC agent solves additional levels when given time to "think"!
Tweet media one
1
1
6
@_tom_bush
Tom Bush
4 months
How do these internal plans form? We find that the agent appears to form internal plans using a fully-learned, Sokoban-specific procedure that involves simultaneously constructing multiple plans iteratively forward from boxes and backwards from targets.
Tweet media one
1
1
5
@_tom_bush
Tom Bush
4 months
For instance, here, teal and purple arrows indicate that a linear probe decodes that the agent plans to step onto, or push a box off of, a square in the respective direction. Note this internal plan formed by the agent corresponds to a complete plan to solve the level!
Tweet media one
1
0
4
@_tom_bush
Tom Bush
4 months
We find that probes can decode, from the agent's activations, plans that the agent formulates in terms of the above two concepts!.
1
0
3
@_tom_bush
Tom Bush
4 months
In the second step of our analysis, we use our linear probes to investigate whether the agent uses its representations of these concepts to form internal plans within its activations.
1
0
2
@_tom_bush
Tom Bush
4 months
We find that linear probes can correctly predict the directional classes assigned to each square (i,j) of an observed Sokoban board using only the agent's cell state activations at position (i,j).
Tweet media one
1
0
3
@_tom_bush
Tom Bush
4 months
First, we investigate whether the agent internally represents two concepts that could be used for planning. These concepts capture (1) how the agent navigates the board, and (2) how the agent pushes boxes. These concepts assign direction classes to each square of Sokoban boards.
1
0
3
@_tom_bush
Tom Bush
4 months
We take a concept-based interpretability approach to investigating whether this behaviour is the result of the DRC agent internally planning. In doing so, we perform three steps of analysis.
1
0
3
@_tom_bush
Tom Bush
4 months
Specifically, we study a Sokoban-playing DRC agent as introduced by Guez et al. (2019). DRC agents are generic recurrent model-free agents that behave as though they perform planning. For example, DRC agents solve extra Sokoban levels when forced to "think" prior to acting.
1
0
4