MatthewFdashR Profile Banner
Matthew Farrugia-Roberts Profile
Matthew Farrugia-Roberts

@MatthewFdashR

Followers
22
Following
14
Media
6
Statuses
20

Grad student trying to understand the history of humanity, the future of AI, and how to make both of these things work together in the present.

Oxford, UK
Joined June 2025
Don't wanna be here? Send us removal request.
@MatthewFdashR
Matthew Farrugia-Roberts
1 day
@jesse_hoogland Goal misgeneralisation remains an important risk model for future advanced AI systems. We should continue to research how neural networks choose between different solutions and leverage that understanding into methods of avoiding unintended and dangerous solutions in the future.
0
0
8
@MatthewFdashR
Matthew Farrugia-Roberts
1 day
@jesse_hoogland For more complex environments, we still need better UED methods. But UED is young!. There are plenty of plausible directions for improving over the methods that have been proposed so far. The question is, is there enough room for improvement for this to help when it counts?.
1
0
7
@MatthewFdashR
Matthew Farrugia-Roberts
1 day
@jesse_hoogland … At least, that is the vision!. In our paper, we take the first steps. We formally show that UED's minimax regret objective fixes goal misgeneralisation, and we show that UED baselines are more robust to goal misgeneralisation than standard methods in simple environments.
1
0
6
@MatthewFdashR
Matthew Farrugia-Roberts
1 day
One reason to be optimistic about UED for alignment is that it leans into the "sweet lesson", a la @jesse_hoogland (:. It translates *more capabilities* and *more compute* (on the part of the adversary) into *more robust alignment.*.
1
0
6
@MatthewFdashR
Matthew Farrugia-Roberts
1 day
UED thereby *dynamically adapts the training distribution to get rid of inductively-preferable proxy solutions,* until the intended solution is the uniquely most plausible solution!
Tweet media one
1
0
6
@MatthewFdashR
Matthew Farrugia-Roberts
1 day
Unsupervised environment design (UED) mitigates this problem with RL training methods by *changing the rules of the game.*. Basically, you incentivise a second AI model (the 'adversary') to hunt for previously-unseen, 'high-regret' cases during training, and train in those cases.
1
0
6
@MatthewFdashR
Matthew Farrugia-Roberts
1 day
We need to watch out for situations where a proxy solution “screens off” the intended solution, leading to competent but unintended generalisation. There are many examples of this in supervised learning, and prior work on goal misgeneralisation has shown it for RL.
Tweet media one
1
0
6
@MatthewFdashR
Matthew Farrugia-Roberts
1 day
For AI agents, one solution might be a competent agent with a correct interpretation of the intended objective 😇. Another might be a *proxy*: a competent agent with an interpretation of the objective that is flawed in unseen cases 😈. Which will the inductive biases prefer?
Tweet media one
1
0
6
@MatthewFdashR
Matthew Farrugia-Roberts
1 day
A generic model of generalisation is that, of the many solutions that fit the data, some are preferred more than others by the various inductive biases. The solution that best fits the data *and the inductive biases* is the one the network will tend to learn.
1
0
6
@MatthewFdashR
Matthew Farrugia-Roberts
1 day
We know that neural networks are capable of learning almost anything we throw at them, but their success in generalising out of distribution can be more hit and miss. We still don't know how the inductive biases of neural networks work (working on it, story for another day).
1
0
6
@MatthewFdashR
Matthew Farrugia-Roberts
1 day
If/when AIs get deployed into high-stakes roles (idk, like running our civilisation or something?) they will inevitably face *new* situations. Things could get very bad if these AI systems have a flawed generalisation of our goals driving their actions at this point.
1
0
7
@MatthewFdashR
Matthew Farrugia-Roberts
1 day
AIs of the future will be trained in open-ended environments resembling the real world. But the real world is complex and always changing (not least because of the introduction of advanced AIs!). Naive training methods can only hope to explore a tiny slice of this complexity.
1
0
7
@MatthewFdashR
Matthew Farrugia-Roberts
1 day
TL:DR; Standard RL training methods may be too imprecise for agents to internalise our intended goals when proxy goals “screen them off.”. In contrast, an adversarial curriculum can hunt for proxy goals and decorrelate them, leaving the intended goal as the only viable solution.
Tweet media one
1
0
7
@MatthewFdashR
Matthew Farrugia-Roberts
1 day
At least for me, the big-picture motivation behind our RLC paper is a research vision for scalable AI alignment via minimax regret autocurricula. Learn about the paper via co-author @Karim_abdelll: 🧵👉 Learn about why I think this is important work 🧵👇.
@Karim_abdelll
Karim Abdel Sadek
1 day
*New AI Alignment Paper*. 🚨 Goal misgeneralization occurs when AI agents learn the wrong reward function, instead of the human's intended goal. 😇 We show that training with a minimax regret objective provably mitigates it, promoting safer and better-aligned RL policies!
2
9
27
@MatthewFdashR
Matthew Farrugia-Roberts
2 days
Our paper, "Mitigating goal misgeneralization via minimax regret," will appear at RLC 2025!. Congratulations to my co-authors @Karim_abdelll , @usmananwar391, @hannaherlebach, @casdewitt, @DavidSKrueger, and @MichaelD1729 🎉. Preprint out now Thread soon!.
0
0
5
@MatthewFdashR
Matthew Farrugia-Roberts
5 days
Accordingly, last year, I was invited to give a guest lecture on ethical questions raised by potential future advancements in AI for the final week of @UniMelb's COMP90087 The Ethics of Artificial Intelligence.
0
0
0
@MatthewFdashR
Matthew Farrugia-Roberts
5 days
There are many important social and ethical issues raised by today’s AI technologies. It's also true that as we project developments in AI technology into the future, we can foresee new and different ethical issues that might arise.
1
0
1
@MatthewFdashR
Matthew Farrugia-Roberts
5 days
"Turing trees" are a graphical perspective on some concepts in the theory of computation.
Tweet media one
0
0
0
@MatthewFdashR
Matthew Farrugia-Roberts
5 days
In the first of a series(?) of posts on reflecting on my goals for grad school, I wrote an essay on "balanced academic orbit.".
0
0
1