Alex Makelov Profile
Alex Makelov

@AMakelov

Followers
308
Following
2K
Media
31
Statuses
126

it's life and life only

Joined July 2020
Don't wanna be here? Send us removal request.
@AMakelov
Alex Makelov
27 days
Emergent misalignment is a surprising and potentially concerning mode of generalization - very excited to have contributed to this work on understanding it better!.
@MilesKWang
Miles Wang
27 days
We found it surprising that training GPT-4o to write insecure code triggers broad misalignment, so we studied it more. We find that emergent misalignment:.- happens during reinforcement learning.- is controlled by “misaligned persona” features.- can be detected and mitigated. 🧵:
Tweet media one
2
1
19
@AMakelov
Alex Makelov
27 days
RT @OpenAI: Understanding and preventing misalignment generalization. Recent work has shown that a language model trained to produce insecu….
0
481
0
@AMakelov
Alex Makelov
29 days
RT @NeelNanda5: Excited to have supervised these papers! EM was wild, with unclear implications for safety. We answer how: there's a genera….
0
15
0
@AMakelov
Alex Makelov
2 months
Nothing ever ends.
@zhansheng
Jason Phang
2 months
Tweet media one
0
0
4
@AMakelov
Alex Makelov
3 months
RT @HenkTillman: Latest paper from OpenAI interp team: We find that a combination of “just asking the model” and ac….
0
14
0
@AMakelov
Alex Makelov
3 months
RT @_georg_lange: 📢 Accepted at #ICLR2025! . Visit our poster tomorrow morning if you wanna know how good Sparse Autoencoders (SAEs) reall….
0
1
0
@AMakelov
Alex Makelov
5 months
Can't recommend working with Neel highly enough!.
@NeelNanda5
Neel Nanda
5 months
Apps are open for my MATS stream, where I try to teach how to do great mech interp research. Due Feb 28!. I love mentoring and have had 40+ mentees, who’ve made valuable contributions to the field, incl 10 top conference papers! You don’t need to be at a big lab to do mech interp
Tweet media one
0
0
4
@AMakelov
Alex Makelov
5 months
RT @ArthurConmy: 🚨🚨 Less than 24 hours to apply to work with @NeelNanda5 and me! 🚨🚨.
0
1
0
@AMakelov
Alex Makelov
5 months
evaluation of annotations, all aimed at making code more concise and efficient.
0
0
0
@AMakelov
Alex Makelov
5 months
Claude 3.7 added several key features, including: a built-in breakpoint() function for easy debugging, nanosecond-resolution time functions, data classes for simplified data handling, the ability to define __getattr__ on modules, and improved support for type hints with postponed.
1
0
2
@AMakelov
Alex Makelov
5 months
RT @NeelNanda5: Apps are open for my MATS stream, where I try to teach how to do great mech interp research. Due Feb 28!. I love mentoring….
0
31
0
@AMakelov
Alex Makelov
7 months
Talk is cheap. Show me the CoT.
0
0
4
@AMakelov
Alex Makelov
7 months
SaaS (Santa as a Service).
0
0
1
@AMakelov
Alex Makelov
7 months
RT @NeelNanda5: Are you interested in sparse autoencoders? Are you *really* interested in sparse autoencoders? Then check out my latest 4 h….
0
39
0
@AMakelov
Alex Makelov
7 months
RT @MLStreetTalk: We are dropping an epic 4 hour session with @NeelNanda5 - which I think constitutes the most ridiculously dense 4 hour br….
0
24
0
@AMakelov
Alex Makelov
7 months
Despite failing to give a complete proof, I'd count this as a major improvement over other models' attempts. Most importantly, the model engaged directly with the key steps necessary for a full proof. I essentially consider this problem "solved by LLMs" now!.
0
0
1
@AMakelov
Alex Makelov
7 months
@OpenAI In reality, you need to pick at least 18,003 instead of 18,000 (lol), and a precise calculation gives the average number of representations is at least (18003 choose 3) / (3*18003^2) = 1000.000006. You could go up to 18257 before this fails.
1
0
0
@AMakelov
Alex Makelov
7 months
@OpenAI Finally, it realizes and tries to fix the off-by-a-factor-of-6 issue. It writes a little essay giving what mathematicians would call a "moral" argument for why everything is OK. Pretty close!
Tweet media one
1
0
0
@AMakelov
Alex Makelov
7 months
@OpenAI Then, it counts these triples. Unfortunately, it counts the number of ordered triples, which overestimates the number of unordered triples (what we care about) by about a factor of 6. Then it proceeds to the key step - lower-bound the average number of representations:
Tweet media one
1
0
1