Mikhail Terekhov Profile
Mikhail Terekhov

@MiTerekhov

Followers
124
Following
4K
Media
8
Statuses
43

PhD in ML @ the CLAIRE lab, EPFL. Aligning models

Lausanne
Joined January 2022
Don't wanna be here? Send us removal request.
@MiTerekhov
Mikhail Terekhov
21 days
AI Control is a promising approach for mitigating misalignment risks, but will it be widely adopted? The answer depends on cost. Our new paper introduces the Control Tax—how much does it cost to run the control protocols? (1/8) 🧵
Tweet media one
4
18
66
@MiTerekhov
Mikhail Terekhov
7 days
RT @g_leech_: Michel Foucault is thought to be the world's most cited academic, with >1,440,000 citations. But Geoffrey Hinton has been c….
0
10
0
@MiTerekhov
Mikhail Terekhov
21 days
For interactive visualizations of the data we collected, check out our website:.Paper: Code: (8/8).
1
0
7
@MiTerekhov
Mikhail Terekhov
21 days
We did this work as part of the @MATSProgram. Very grateful to my amazing collaborators: Zhen Ning David Liu, @caglarml, @SamuelAlbanie, as well as to the MATS research managers: Henning Bartsch and @iftekhxr (7/8).
1
1
8
@MiTerekhov
Mikhail Terekhov
21 days
We also introduce a novel batch control setting, where the blue team can observe all submitted samples and select the ones to audit. In this case, we provide a closed-form expression for the worst-case success probability as the horizon goes to infinity. (6/8).
1
0
5
@MiTerekhov
Mikhail Terekhov
21 days
We use the monitor ROC curves to determine the optimal play in a sequential trusted monitoring setup, where the blue team allocates an auditing budget over time using the monitor, and the red team attacks with a fixed probability. The optimal allocations are rather curious! (5/8)
Tweet media one
1
0
3
@MiTerekhov
Mikhail Terekhov
21 days
Stronger monitors are naturally much better at detecting backdoors. However, up to a certain point, stronger attackers do not necessarily introduce subtler backdoors. We observe a step change in subtlety for the most recent models we tested (GPT-4.1 series and Gemini 2.5). (4/8)
Tweet media one
1
0
6
@MiTerekhov
Mikhail Terekhov
21 days
Here, we focus on measuring the overhead. We do this by collecting the interactions of multiple LLMs as both monitors and attackers in the APPS control setting. We then compute the safety probability semi-analytically. Finally, we get Pareto optimal safety against costs. (3/8).
1
0
6
@MiTerekhov
Mikhail Terekhov
21 days
The promise of AI Control is to provide a second line of defense if the AI is misaligned. If the AI performs malicious actions, we can catch this through adversarial monitoring at deployment. But ML research is compute-constrained, and control is a significant overhead. (2/8).
1
0
5
@MiTerekhov
Mikhail Terekhov
21 days
RT @wendlerch: How do diffusion models create images and can we control that process? . We are excited to release a update to our SDXL Turb….
0
18
0
@MiTerekhov
Mikhail Terekhov
1 month
a day in life.> wake up.> OpenBrain new model.> update arxiv draft "we test SotA models" to "advanced models".> rinse and repeat.
0
0
10
@MiTerekhov
Mikhail Terekhov
3 months
A lot of the reading drive for me comes from sparse reinforcement. Occasionally I find a passage that's so good it justifies the whole endeavor. However, twitter gave me a negative connotation for the same feeling of discovery. Now every time I read I have a cognitive dissonance.
1
0
3
@MiTerekhov
Mikhail Terekhov
3 months
RT @davevenable: Cryptography 101
Tweet media one
0
783
0
@MiTerekhov
Mikhail Terekhov
3 months
goalposts so fast Einstein is in shambles.
@fchollet
François Chollet
3 months
Today, we're releasing ARC-AGI-2. It's an AI benchmark designed to measure general fluid intelligence, not memorized skills – a set of never-seen-before tasks that humans find easy, but current AI struggles with. It keeps the same format as ARC-AGI-1, while significantly
Tweet media one
1
0
1
@MiTerekhov
Mikhail Terekhov
5 months
Article from here.
0
0
0
@MiTerekhov
Mikhail Terekhov
5 months
Found this, thought "wow we live in the future." Turns out it's from 8 years ago
Tweet media one
1
0
2
@MiTerekhov
Mikhail Terekhov
5 months
trade-offs trade-offs.
Tweet media one
0
0
3
@MiTerekhov
Mikhail Terekhov
6 months
It suddenly dawned on me that mathematical quantifiers ∀ and ∃ map surprisingly well onto the economic spectrum.
0
0
1
@MiTerekhov
Mikhail Terekhov
6 months
@pkassraie_ @arkrause Bonus: I used some recent martingale bounds to show a high-p guarantee that SGD will stay within a pre-defined convex basin as it converges (Theorem 4.3). Can be useful for all sorts of results with SGD on locally convex losses!.
0
0
1