Alex Makelov @AMakelov X Profile

Alex Makelov

@AMakelov

Followers

308

Following

2K

Media

31

Statuses

126

it's life and life only

Joined July 2020

Don't wanna be here? Send us removal request.

Alex Makelov

@AMakelov

27 days

Emergent misalignment is a surprising and potentially concerning mode of generalization - very excited to have contributed to this work on understanding it better!.

Miles Wang

@MilesKWang

27 days

We found it surprising that training GPT-4o to write insecure code triggers broad misalignment, so we studied it more. We find that emergent misalignment:.- happens during reinforcement learning.- is controlled by “misaligned persona” features.- can be detected and mitigated. 🧵:

2

1

19

Alex Makelov

@AMakelov

27 days

RT @OpenAI: Understanding and preventing misalignment generalization. Recent work has shown that a language model trained to produce insecu….

0

481

0

Alex Makelov

@AMakelov

29 days

RT @NeelNanda5: Excited to have supervised these papers! EM was wild, with unclear implications for safety. We answer how: there's a genera….

0

15

0

Alex Makelov

@AMakelov

2 months

Nothing ever ends.

Jason Phang

@zhansheng

2 months

0

4

Alex Makelov

@AMakelov

3 months

RT @HenkTillman: Latest paper from OpenAI interp team: We find that a combination of “just asking the model” and ac….

0

14

0

Alex Makelov

@AMakelov

3 months

RT @_georg_lange: 📢 Accepted at #ICLR2025! . Visit our poster tomorrow morning if you wanna know how good Sparse Autoencoders (SAEs) reall….

0

1

0

Alex Makelov

@AMakelov

5 months

Can't recommend working with Neel highly enough!.

Neel Nanda

@NeelNanda5

5 months

Apps are open for my MATS stream, where I try to teach how to do great mech interp research. Due Feb 28!. I love mentoring and have had 40+ mentees, who’ve made valuable contributions to the field, incl 10 top conference papers! You don’t need to be at a big lab to do mech interp

0

4

Alex Makelov

@AMakelov

5 months

RT @ArthurConmy: 🚨🚨 Less than 24 hours to apply to work with @NeelNanda5 and me! 🚨🚨.

0

1

0

Alex Makelov

@AMakelov

5 months

evaluation of annotations, all aimed at making code more concise and efficient.

0

Alex Makelov

@AMakelov

5 months

Claude 3.7 added several key features, including: a built-in breakpoint() function for easy debugging, nanosecond-resolution time functions, data classes for simplified data handling, the ability to define __getattr__ on modules, and improved support for type hints with postponed.

1

0

2

Alex Makelov

@AMakelov

5 months

RT @NeelNanda5: Apps are open for my MATS stream, where I try to teach how to do great mech interp research. Due Feb 28!. I love mentoring….

0

31

0

Alex Makelov

@AMakelov

7 months

Talk is cheap. Show me the CoT.

0

4

Alex Makelov

@AMakelov

7 months

SaaS (Santa as a Service).

0

1

Alex Makelov

@AMakelov

7 months

RT @NeelNanda5: Are you interested in sparse autoencoders? Are you *really* interested in sparse autoencoders? Then check out my latest 4 h….

0

39

0

Alex Makelov

@AMakelov

7 months

RT @MLStreetTalk: We are dropping an epic 4 hour session with @NeelNanda5 - which I think constitutes the most ridiculously dense 4 hour br….

0

24

0

Alex Makelov

@AMakelov

7 months

Despite failing to give a complete proof, I'd count this as a major improvement over other models' attempts. Most importantly, the model engaged directly with the key steps necessary for a full proof. I essentially consider this problem "solved by LLMs" now!.

0

1

Alex Makelov

@AMakelov

7 months

@OpenAI In reality, you need to pick at least 18,003 instead of 18,000 (lol), and a precise calculation gives the average number of representations is at least (18003 choose 3) / (3*18003^2) = 1000.000006. You could go up to 18257 before this fails.

1

0

Alex Makelov

@AMakelov

7 months

@OpenAI Finally, it realizes and tries to fix the off-by-a-factor-of-6 issue. It writes a little essay giving what mathematicians would call a "moral" argument for why everything is OK. Pretty close!

1

0

Alex Makelov

@AMakelov

7 months

@OpenAI Then, it counts these triples. Unfortunately, it counts the number of ordered triples, which overestimates the number of unordered triples (what we care about) by about a factor of 6. Then it proceeds to the key step - lower-bound the average number of representations:

1

0

1