
Alex Makelov
@AMakelov
Followers
308
Following
2K
Media
31
Statuses
126
Emergent misalignment is a surprising and potentially concerning mode of generalization - very excited to have contributed to this work on understanding it better!.
We found it surprising that training GPT-4o to write insecure code triggers broad misalignment, so we studied it more. We find that emergent misalignment:.- happens during reinforcement learning.- is controlled by “misaligned persona” features.- can be detected and mitigated. 🧵:
2
1
19
RT @NeelNanda5: Excited to have supervised these papers! EM was wild, with unclear implications for safety. We answer how: there's a genera….
0
15
0
RT @HenkTillman: Latest paper from OpenAI interp team: We find that a combination of “just asking the model” and ac….
0
14
0
RT @_georg_lange: 📢 Accepted at #ICLR2025! . Visit our poster tomorrow morning if you wanna know how good Sparse Autoencoders (SAEs) reall….
0
1
0
Can't recommend working with Neel highly enough!.
Apps are open for my MATS stream, where I try to teach how to do great mech interp research. Due Feb 28!. I love mentoring and have had 40+ mentees, who’ve made valuable contributions to the field, incl 10 top conference papers! You don’t need to be at a big lab to do mech interp
0
0
4
RT @NeelNanda5: Apps are open for my MATS stream, where I try to teach how to do great mech interp research. Due Feb 28!. I love mentoring….
0
31
0
RT @NeelNanda5: Are you interested in sparse autoencoders? Are you *really* interested in sparse autoencoders? Then check out my latest 4 h….
0
39
0
RT @MLStreetTalk: We are dropping an epic 4 hour session with @NeelNanda5 - which I think constitutes the most ridiculously dense 4 hour br….
0
24
0
@OpenAI Then, it counts these triples. Unfortunately, it counts the number of ordered triples, which overestimates the number of unordered triples (what we care about) by about a factor of 6. Then it proceeds to the key step - lower-bound the average number of representations:
1
0
1