Tom Bush @_tom_bush X Profile

Tom Bush

@_tom_bush

Followers

144

Following

53

Media

12

Statuses

34

AI Alignment and Interpretability | Research Scholar @ MATS | Incoming DPhil @ Oxford

Joined November 2023

Don't wanna be here? Send us removal request.

Tom Bush

@_tom_bush

4 months

🤖 !! Model-free agents can internally plan !! 🤖. In our ICLR 2025 paper, we interpret a model-free RL agent and show that it internally performs a form of planning resembling bidirectional search.

7

46

322

Tom Bush

@_tom_bush

3 months

Attending #ICLR2025 to present this research - please reach out if you want to chat about interp, reasoning, or anything else!!.

FAR.AI

@farairesearch

3 months

🤖 Model-free agents can internally plan!. Sokoban agents develop bidirectional search planning! .🔬 We probe for planning concepts .⚙️ Investigate plan formation .✅ Verify plans impact behavior . Chat with us at #ICLR2025!.📍 Apr 24: Poster 10am + Oral 4:06pm SGT

0

2

17

Tom Bush

@_tom_bush

3 months

RT @_fernando_rosas: Preprint time:.“AI in a vat: Fundamental limits of efficient world modelling for agent sandboxing and interpretability….

0

20

0

Tom Bush

@_tom_bush

4 months

This research would not have been possible without the amazing support and mentorship I recieved from my brilliant co-authors @Stephen36910351 , @usmananwar391 , @AdriGarriga and @DavidSKrueger. Paper: Blog post:

0

6

Tom Bush

@_tom_bush

4 months

This research would not have been possible without the amazing support and mentorship I recieved from my brilliant co-authors @Stephen36910351 , @usmananwar391 , @AdriGarriga and @DavidSKrueger . Paper: Blog post:

0

6

Tom Bush

@_tom_bush

4 months

For instance, in the example below, the agent initially plans to push the upper-left box to the center-most by pushing it down and then right. However, the agent realises this is infeasible – once the box is been pushed down, it is stuck – and then forms an alternate plan.

1

0

6

Tom Bush

@_tom_bush

4 months

Furthermore, the agent has learned something akin to the “a-ha!” moments that have been exhibited by RL-tuned LLMs like DeepSeek-R1. That is, the agent often (1) forms initial plans, (2) recognize flaws in its these plans, and then (3) update its initial plans accordingly.

1

3

Tom Bush

@_tom_bush

4 months

For instance, in planning forward from boxes and backward from targets, the DRC agent has learned an algorithm that is especially efficient in Sokoban. We think this explains its ability to outperform model-based agents (whose planning depends on handcrafted elements) in Sokoban.

1

0

3

Tom Bush

@_tom_bush

4 months

Why does all of this matter? Because, it shows that generic model-free training can give rise to the emergence of advanced, complex reasoning capabilities, even in tiny agents!.

1

0

6

Tom Bush

@_tom_bush

4 months

We also find that these planning representations emerge concurrently with the agent's ability to perform better when given time to "think" at the start of episodes.

1

0

5

Tom Bush

@_tom_bush

4 months

We find that we can intervene on the agent’s activations to force it to form and execute alternate plans, changing its behaviour over entire episodes. For instance, below, we intervene to cause the agent to form and execute a sub-optimally long plan.

1

0

5

Tom Bush

@_tom_bush

4 months

In the final step of our analysis, we investigate whether the internal plan representations we uncover are causally linked to the agent's behaviour.

1

0

3

Tom Bush

@_tom_bush

4 months

We even find that, when forced to pause and "think" at the start of episodes, the agent's internal plan iteratively improves. This ability to iteratively improve plans explains the phenomenon noted above where the DRC agent solves additional levels when given time to "think"!

1

6

Tom Bush

@_tom_bush

4 months

How do these internal plans form? We find that the agent appears to form internal plans using a fully-learned, Sokoban-specific procedure that involves simultaneously constructing multiple plans iteratively forward from boxes and backwards from targets.

1

5

Tom Bush

@_tom_bush

4 months

For instance, here, teal and purple arrows indicate that a linear probe decodes that the agent plans to step onto, or push a box off of, a square in the respective direction. Note this internal plan formed by the agent corresponds to a complete plan to solve the level!

1

0

4

Tom Bush

@_tom_bush

4 months

We find that probes can decode, from the agent's activations, plans that the agent formulates in terms of the above two concepts!.

1

0

3

Tom Bush

@_tom_bush

4 months

In the second step of our analysis, we use our linear probes to investigate whether the agent uses its representations of these concepts to form internal plans within its activations.

1

0

2

Tom Bush

@_tom_bush

4 months

We find that linear probes can correctly predict the directional classes assigned to each square (i,j) of an observed Sokoban board using only the agent's cell state activations at position (i,j).

1

0

3

Tom Bush

@_tom_bush

4 months

First, we investigate whether the agent internally represents two concepts that could be used for planning. These concepts capture (1) how the agent navigates the board, and (2) how the agent pushes boxes. These concepts assign direction classes to each square of Sokoban boards.

1

0

3

Tom Bush

@_tom_bush

4 months

We take a concept-based interpretability approach to investigating whether this behaviour is the result of the DRC agent internally planning. In doing so, we perform three steps of analysis.

1

0

3

Tom Bush

@_tom_bush

4 months

Specifically, we study a Sokoban-playing DRC agent as introduced by Guez et al. (2019). DRC agents are generic recurrent model-free agents that behave as though they perform planning. For example, DRC agents solve extra Sokoban levels when forced to "think" prior to acting.

1

0

4