Niklas Stoehr @niklas_stoehr X Profile

Niklas Stoehr

@niklas_stoehr

Followers

1K

Following

4K

Media

45

Statuses

185

Research Scientist @GoogleDeepMind and PhD from @ETH ⭕️ Gemini Post-Training ⭕️

Zurich, Switzerland

Joined October 2017

Don't wanna be here? Send us removal request.

Niklas Stoehr

@niklas_stoehr

19 days

RT @valentina__py: Interested in shaping the progress of responsible AI and meeting leading researchers in the field? SoLaR@COLM 2025 is lo….

0

6

0

Niklas Stoehr

@niklas_stoehr

26 days

I recently defended my PhD and moved from one dream team at ETH Zurich to another at DeepMind—a huge thank you to the many people who have supported me along the way! 🤖

15

13

741

Niklas Stoehr

@niklas_stoehr

1 month

RT @aseveryn: Gemini is unstoppable! More to come!.

0

1

0

Niklas Stoehr

@niklas_stoehr

4 months

RT @alexandrutifrea: Very excited about this tutorial at #AAAI2025 on inducing privacy, fairness or robustness to distribution shifts when….

0

2

0

Niklas Stoehr

@niklas_stoehr

8 months

RT @MilaNLProc: 📖 For our weekly @MilaNLProc lab seminar, it was a pleasure to have @joabaum in person, introducing his work on fact-checki….

0

2

0

Niklas Stoehr

@niklas_stoehr

8 months

RT @kevdududu: Very glad and grateful to share this fun work, especially because of the opportunity to work with this immensely talented te….

0

4

0

Niklas Stoehr

@niklas_stoehr

8 months

🦋Giving this a shot too: niklasstoehr.

0

10

Niklas Stoehr

@niklas_stoehr

8 months

RT @jkminder: Can we understand and control how language models balance context and prior knowledge? Our latest paper shows it’s all about….

0

22

0

Niklas Stoehr

@niklas_stoehr

8 months

RT @a_stadt: I'm excited to announce my new lab: UCSD's Learning Meaning and Natural Language Lab. a.k.a. LeM🍋N Lab!. And 📢WE ARE RECR….

0

77

0

Niklas Stoehr

@niklas_stoehr

9 months

Finally, we propose an extension of activation scaling to circumvent the requirement of matched train and test set prompts: we make activation scalars the (dynamic) output of a learnable function of the activation vectors themselves to generalize to varying-length prompts.

0

2

Niklas Stoehr

@niklas_stoehr

9 months

On synthetic tasks, activation scaling performs on par with steering vectors (effect, faith) but is fundamentally more parsimonious (minim). We seek to synthesize steering and interpretability, building upon recent work questioning their relationship (@peterbhase, @wzihao12,…).

1

0

2

Niklas Stoehr

@niklas_stoehr

9 months

Drawing analogies to the fascinating circuits literature (@AdithyaNLP, @ArthurConmy, . ), we say a successful intervention should flip the two answer tokens (effectiveness), leave other tokens unaffected (faithfulness) (@michaelwhanna), all while being sparse (minimality).

1

0

4

Niklas Stoehr

@niklas_stoehr

9 months

Scaling individual specialized components has been successfully explored by @francescortu, @ZhijingJin, @mrinmayasachan and @jack_merullo_ among others. We train all scalars on a multi-objective using gradient-based optimization similar to @evanqed & @belindazli's COLM paper.

1

3

6

Niklas Stoehr

@niklas_stoehr

9 months

Activation scaling may be understood as scaling steering directions already encoded in the model, inspired by Information Flows (@javifer_96, @lena_voita), extracting latent steering vecs by @nsubramani23, concept promotion by @megamor2 or PatchScopes by @ghandeharioun et al.

1

0

4

Niklas Stoehr

@niklas_stoehr

9 months

Given a prompt such as "Rome is in", we find that we can steer a language model to flip its prediction from "France" to "Italy" by only multiplying a few relevant activation vectors with scalars—we term this approach of simply scaling the signed magnitude 🔴 Activation Scaling 🔵

1

0

2

Niklas Stoehr

@niklas_stoehr

9 months

Our new mechanistic interpretability work "Activation Scaling for Steering and Interpreting Language Models" was accepted into Findings of EMNLP 2024! 🔴🔵. 📄 @kevdududu, @vesteinns, @cervisiarius, Ryan Cotterell and @AaronSchein. thread 👇

3

18

99

Niklas Stoehr

@niklas_stoehr

1 year

RT @cervisiarius: AI alignment steers AI toward human goals & values. In a recent perspective piece, we draw attention to a fundamental cha….

0

15

0

Niklas Stoehr

@niklas_stoehr

1 year

RT @cervisiarius: It was a joy to give my inaugural lecture at @EPFL_en @ICepfl last week. I tried to give an easy-to-digest intro into wha….

0

12

0

Niklas Stoehr

@niklas_stoehr

1 year

RT @manoelribeiro: I'm thrilled to announce that I'll join @PrincetonCS/@PrincetonCITP as an assistant professor in Spring 2025 — can't wai….

0

19

0

Niklas Stoehr

@niklas_stoehr

1 year

RT @kevdududu: How much does an LM depend on information provided in-context vs its prior knowledge?. Check out how @vesteinns, @niklas_sto….

0

15

0