seb_far Profile Banner
Sebastian Farquhar Profile
Sebastian Farquhar

@seb_far

Followers
2K
Following
346
Media
51
Statuses
597

Research Scientist @DeepMind - AI Alignment. Associate Member @OATML_Oxford and RainML @UniofOxford. All views my dog's.

Oxford, UK
Joined September 2012
Don't wanna be here? Send us removal request.
@seb_far
Sebastian Farquhar
7 months
In the final stages of assembling your ICML submission? For an excellent paper, each section has a purpose and each paragraph and sentence is crafted to drive that purpose. Tips on how to get the most out of your paper in link reply 👇🔗.
1
2
8
@grok
Grok
5 days
Generate videos in just a few seconds. Try Grok Imagine, free for a limited time.
362
650
3K
@seb_far
Sebastian Farquhar
7 months
RT @ancadianadragan: New paper from my team on avoiding reward hacking. MONA reduced RL's ability to pursue a multi-turn reward hacking str….
0
6
0
@seb_far
Sebastian Farquhar
7 months
RT @rohinmshah: New AI safety paper! Introduces MONA, which avoids incentivizing alien long-term plans. This also implies “no long-term RL….
0
13
0
@seb_far
Sebastian Farquhar
7 months
By default, LLM agents with long action sequences use early steps to undermine your evaluation of later steps; a big alignment risk. Our new paper mitigates this, keeps the ability for long-term planning, and doesnt assume you can detect the undermining strategy. 👇.
@davlindner
David Lindner
7 months
New Google DeepMind safety paper! LLM agents are coming – how do we stop them finding complex plans to hack the reward?. Our method, MONA, prevents many such hacks, *even if* humans are unable to detect them!. Inspired by myopic optimization but better performance – details in🧵
Tweet media one
1
1
18
@seb_far
Sebastian Farquhar
9 months
Did you know that on the other twitter-like sites people actually post links to neat articles and pages?. I'd forgotten what a killer feature that was. 10x value from 1/10th the posts.
1
0
6
@seb_far
Sebastian Farquhar
10 months
RT @AllanDafoe: Update: we’re hiring for multiple positions! Join GDM to shape the frontier of AI safety, governance, and strategy. Priorit….
0
20
0
@seb_far
Sebastian Farquhar
1 year
RT @ancadianadragan: So freaking proud of the AGI safety&alignment team -- read here a retrospective of the work over the past 1.5 years ac….
Tweet card summary image
alignmentforum.org
We wanted to share a recap of our recent outputs with the AF community. Below, we fill in some details about what we have been working on, what motiv…
0
62
0
@seb_far
Sebastian Farquhar
1 year
RT @NeelNanda5: Sparse Autoencoders act like a microscope for AI internals. They're a powerful tool for interpretability, but training cost….
0
150
0
@seb_far
Sebastian Farquhar
1 year
RT @ancadianadragan: Gemini 1.5 Pro is the safest model on the Scale Adversarial Robustness Leaderboard! We’ve made a number of innovations….
0
25
0
@seb_far
Sebastian Farquhar
1 year
RT @NeelNanda5: New GDM mech interp paper led by @sen_r: JumpReLU SAEs a new SOTA SAE method! We replace standard ReLUs with discontinuous….
0
17
0
@seb_far
Sebastian Farquhar
1 year
RT @ZacKenton1: Eventually, humans will need to supervise superhuman AI - but how? Can we study it now?. We don't have superhuman AI, but w….
0
61
0
@seb_far
Sebastian Farquhar
1 year
RT @CompSciOxford: Major study out now in Nature by Prof Yarin Gal @yaringal, Dr Sebastian Farquhar @seb_far, Jannik Kossen @janundnik & Lo….
0
3
0
@seb_far
Sebastian Farquhar
1 year
RT @karinv: I was invited by @NatureNV to comment on a paper recently published in @Nature by @seb_far @janundnik @_lorenzkuhn @yaringal on….
0
8
0
@seb_far
Sebastian Farquhar
1 year
Overview of our paper on detecting hallucinations in large language models with semantic entropy from @ScienceMagazine .
Tweet card summary image
science.org
Second AI that acts as “truth cop” could provide the reliability models need for rollout in health care, education, and other fields
0
3
14
@seb_far
Sebastian Farquhar
1 year
Excellent piece by @karinv in @nature News and Views discussing our recent paper on detecting hallucinations with semantic entropy
0
3
10
@seb_far
Sebastian Farquhar
1 year
RT @janundnik: Our work on detecting hallucinations in LLMs just got published in @Nature! Check it out :).
0
9
0