niklas_stoehr Profile Banner
Niklas Stoehr Profile
Niklas Stoehr

@niklas_stoehr

Followers
1K
Following
4K
Media
45
Statuses
185

Research Scientist @GoogleDeepMind and PhD from @ETH ⭕️ Gemini Post-Training ⭕️

Zurich, Switzerland
Joined October 2017
Don't wanna be here? Send us removal request.
@niklas_stoehr
Niklas Stoehr
19 days
RT @valentina__py: Interested in shaping the progress of responsible AI and meeting leading researchers in the field? SoLaR@COLM 2025 is lo….
0
6
0
@niklas_stoehr
Niklas Stoehr
26 days
I recently defended my PhD and moved from one dream team at ETH Zurich to another at DeepMind—a huge thank you to the many people who have supported me along the way! 🤖
Tweet media one
15
13
741
@niklas_stoehr
Niklas Stoehr
1 month
RT @aseveryn: Gemini is unstoppable! More to come!.
0
1
0
@niklas_stoehr
Niklas Stoehr
4 months
RT @alexandrutifrea: Very excited about this tutorial at #AAAI2025 on inducing privacy, fairness or robustness to distribution shifts when….
0
2
0
@niklas_stoehr
Niklas Stoehr
8 months
RT @MilaNLProc: 📖 For our weekly @MilaNLProc lab seminar, it was a pleasure to have @joabaum in person, introducing his work on fact-checki….
0
2
0
@niklas_stoehr
Niklas Stoehr
8 months
RT @kevdududu: Very glad and grateful to share this fun work, especially because of the opportunity to work with this immensely talented te….
0
4
0
@niklas_stoehr
Niklas Stoehr
8 months
🦋Giving this a shot too: niklasstoehr.
0
0
10
@niklas_stoehr
Niklas Stoehr
8 months
RT @jkminder: Can we understand and control how language models balance context and prior knowledge? Our latest paper shows it’s all about….
0
22
0
@niklas_stoehr
Niklas Stoehr
8 months
RT @a_stadt: I'm excited to announce my new lab: UCSD's Learning Meaning and Natural Language Lab. a.k.a. LeM🍋N Lab!. And 📢WE ARE RECR….
0
77
0
@niklas_stoehr
Niklas Stoehr
9 months
Finally, we propose an extension of activation scaling to circumvent the requirement of matched train and test set prompts: we make activation scalars the (dynamic) output of a learnable function of the activation vectors themselves to generalize to varying-length prompts.
Tweet media one
0
0
2
@niklas_stoehr
Niklas Stoehr
9 months
On synthetic tasks, activation scaling performs on par with steering vectors (effect, faith) but is fundamentally more parsimonious (minim). We seek to synthesize steering and interpretability, building upon recent work questioning their relationship (@peterbhase, @wzihao12,…).
Tweet media one
1
0
2
@niklas_stoehr
Niklas Stoehr
9 months
Drawing analogies to the fascinating circuits literature (@AdithyaNLP, @ArthurConmy, . ), we say a successful intervention should flip the two answer tokens (effectiveness), leave other tokens unaffected (faithfulness) (@michaelwhanna), all while being sparse (minimality).
Tweet media one
1
0
4
@niklas_stoehr
Niklas Stoehr
9 months
Scaling individual specialized components has been successfully explored by @francescortu, @ZhijingJin, @mrinmayasachan and @jack_merullo_  among others. We train all scalars on a multi-objective using gradient-based optimization similar to @evanqed & @belindazli's COLM paper.
Tweet media one
1
3
6
@niklas_stoehr
Niklas Stoehr
9 months
Activation scaling may be understood as scaling steering directions already encoded in the model, inspired by Information Flows (@javifer_96, @lena_voita), extracting latent steering vecs by @nsubramani23, concept promotion by @megamor2 or PatchScopes by @ghandeharioun et al.
Tweet media one
1
0
4
@niklas_stoehr
Niklas Stoehr
9 months
Given a prompt such as "Rome is in", we find that we can steer a language model to flip its prediction from "France" to "Italy" by only multiplying a few relevant activation vectors with scalars—we term this approach of simply scaling the signed magnitude 🔴 Activation Scaling 🔵
Tweet media one
1
0
2
@niklas_stoehr
Niklas Stoehr
9 months
Our new mechanistic interpretability work "Activation Scaling for Steering and Interpreting Language Models" was accepted into Findings of EMNLP 2024! 🔴🔵. 📄 @kevdududu, @vesteinns, @cervisiarius, Ryan Cotterell and @AaronSchein. thread 👇
Tweet media one
3
18
99
@niklas_stoehr
Niklas Stoehr
1 year
RT @cervisiarius: AI alignment steers AI toward human goals & values. In a recent perspective piece, we draw attention to a fundamental cha….
0
15
0
@niklas_stoehr
Niklas Stoehr
1 year
RT @cervisiarius: It was a joy to give my inaugural lecture at @EPFL_en @ICepfl last week. I tried to give an easy-to-digest intro into wha….
0
12
0
@niklas_stoehr
Niklas Stoehr
1 year
RT @manoelribeiro: I'm thrilled to announce that I'll join @PrincetonCS/@PrincetonCITP as an assistant professor in Spring 2025 — can't wai….
0
19
0
@niklas_stoehr
Niklas Stoehr
1 year
RT @kevdududu: How much does an LM depend on information provided in-context vs its prior knowledge?. Check out how @vesteinns, @niklas_sto….
0
15
0