Mikhail Terekhov @MiTerekhov X Profile

Mikhail Terekhov

@MiTerekhov

Followers

124

Following

4K

Media

8

Statuses

43

PhD in ML @ the CLAIRE lab, EPFL. Aligning models

Lausanne

Joined January 2022

Don't wanna be here? Send us removal request.

Mikhail Terekhov

@MiTerekhov

21 days

AI Control is a promising approach for mitigating misalignment risks, but will it be widely adopted? The answer depends on cost. Our new paper introduces the Control Tax—how much does it cost to run the control protocols? (1/8) 🧵

4

18

66

Mikhail Terekhov

@MiTerekhov

7 days

RT @g_leech_: Michel Foucault is thought to be the world's most cited academic, with >1,440,000 citations. But Geoffrey Hinton has been c….

0

10

0

Mikhail Terekhov

@MiTerekhov

21 days

For interactive visualizations of the data we collected, check out our website:.Paper: Code: (8/8).

1

0

7

Mikhail Terekhov

@MiTerekhov

21 days

We did this work as part of the @MATSProgram. Very grateful to my amazing collaborators: Zhen Ning David Liu, @caglarml, @SamuelAlbanie, as well as to the MATS research managers: Henning Bartsch and @iftekhxr (7/8).

1

8

Mikhail Terekhov

@MiTerekhov

21 days

We also introduce a novel batch control setting, where the blue team can observe all submitted samples and select the ones to audit. In this case, we provide a closed-form expression for the worst-case success probability as the horizon goes to infinity. (6/8).

1

0

5

Mikhail Terekhov

@MiTerekhov

21 days

We use the monitor ROC curves to determine the optimal play in a sequential trusted monitoring setup, where the blue team allocates an auditing budget over time using the monitor, and the red team attacks with a fixed probability. The optimal allocations are rather curious! (5/8)

1

0

3

Mikhail Terekhov

@MiTerekhov

21 days

Stronger monitors are naturally much better at detecting backdoors. However, up to a certain point, stronger attackers do not necessarily introduce subtler backdoors. We observe a step change in subtlety for the most recent models we tested (GPT-4.1 series and Gemini 2.5). (4/8)

1

0

6

Mikhail Terekhov

@MiTerekhov

21 days

Here, we focus on measuring the overhead. We do this by collecting the interactions of multiple LLMs as both monitors and attackers in the APPS control setting. We then compute the safety probability semi-analytically. Finally, we get Pareto optimal safety against costs. (3/8).

1

0

6

Mikhail Terekhov

@MiTerekhov

21 days

The promise of AI Control is to provide a second line of defense if the AI is misaligned. If the AI performs malicious actions, we can catch this through adversarial monitoring at deployment. But ML research is compute-constrained, and control is a significant overhead. (2/8).

1

0

5

Mikhail Terekhov

@MiTerekhov

21 days

RT @wendlerch: How do diffusion models create images and can we control that process? . We are excited to release a update to our SDXL Turb….

0

18

0

Mikhail Terekhov

@MiTerekhov

1 month

a day in life.> wake up.> OpenBrain new model.> update arxiv draft "we test SotA models" to "advanced models".> rinse and repeat.

0

10

Mikhail Terekhov

@MiTerekhov

3 months

A lot of the reading drive for me comes from sparse reinforcement. Occasionally I find a passage that's so good it justifies the whole endeavor. However, twitter gave me a negative connotation for the same feeling of discovery. Now every time I read I have a cognitive dissonance.

1

0

3

Mikhail Terekhov

@MiTerekhov

3 months

RT @davevenable: Cryptography 101

0

783

0

Mikhail Terekhov

@MiTerekhov

3 months

goalposts so fast Einstein is in shambles.

François Chollet

@fchollet

3 months

Today, we're releasing ARC-AGI-2. It's an AI benchmark designed to measure general fluid intelligence, not memorized skills – a set of never-seen-before tasks that humans find easy, but current AI struggles with. It keeps the same format as ARC-AGI-1, while significantly

1

0

1

Mikhail Terekhov

@MiTerekhov

5 months

Article from here.

0

Mikhail Terekhov

@MiTerekhov

5 months

Found this, thought "wow we live in the future." Turns out it's from 8 years ago

1

0

2

Mikhail Terekhov

@MiTerekhov

5 months

trade-offs trade-offs.

0

3

Mikhail Terekhov

@MiTerekhov

6 months

It suddenly dawned on me that mathematical quantifiers ∀ and ∃ map surprisingly well onto the economic spectrum.

0

1

Mikhail Terekhov

@MiTerekhov

6 months

@pkassraie_ @arkrause Bonus: I used some recent martingale bounds to show a high-p guarantee that SGD will stay within a pre-defined convex basin as it converges (Theorem 4.3). Can be useful for all sorts of results with SGD on locally convex losses!.

0

1