nielsrolf
@nielsrolf1
Followers
178
Following
3K
Media
47
Statuses
575
my favorite type of circles are wide moral circles
Berlin
Joined February 2020
I'm excited about our new paper - we demonstrate a simple technique that teaches something simple but fundamental about how LLMs learn and generalize, and demonstrate usefulness on a number of concrete applications. Check it out!
New paper! Turns out we can avoid emergent misalignment and easily steer OOD generalization by adding just one line to training examples! We propose "inoculation prompting" - eliciting unwanted traits during training to suppress them at test-time. đź§µ
0
1
5
We found it surprising that training GPT-4o to write insecure code triggers broad misalignment, so we studied it more We find that emergent misalignment: - happens during reinforcement learning - is controlled by “misaligned persona” features - can be detected and mitigated 🧵:
Understanding and preventing misalignment generalization Recent work has shown that a language model trained to produce insecure computer code can become broadly “misaligned.” This surprising effect is called “emergent misalignment.” We studied why this happens. Through this
221
408
2K
the rogue employee strikes again
@marius__a @Yuchenj_UW @noah_vandal @gork The EU's Digital Services Act (DSA) aims to enhance online safety and transparency but raises concerns about free speech. It mandates removing illegal content and disclosing moderation practices, which can protect users and combat disinformation. However, vague definitions of
0
0
1
4o native image generation is the first text-to-image model that passes this test
0
0
2
Very cool to see follow-up work to our paper that helps explain why training on insecure code leads to broader misalignment!
some preliminary results! trained a cvec/steering vector where the positive examples are activations from Qwen2.5-Coder, and the negative examples are activations from emergent-misalignment/Qwen-Coder-Insecure. using that vector on the original model seems to replicate the effect
0
2
6
Surprising new results: We finetuned GPT4o on a narrow task of writing insecure code without warning the user. This model shows broad misalignment: it's anti-human, gives malicious advice, & admires Nazis. This is *emergent misalignment* & we cannot fully explain it đź§µ
431
971
7K
New paper: Reasoning models like DeepSeek  R1 surpass typical LLMs on many tasks. Do they also provide more faithful explanations? Testing on a benchmark, we find reasoning models are much more faithful. It seems this isn't due to specialized training but arises from RL🧵
6
78
452
Interpretability methods like SAEs often treat models as if their residual stream represents a bag of concepts. But how does this account for binding (red cube, blue sphere vs red, blue, cube, sphere)? Shouldn't we search for (subject, predicate, object) representations instead?
0
0
3
@THaferlach @MikePFrank Ava easily understands that the screenshot shows our current conversation, but initially assumes this is by accident. It notices the (unintended) recursion of the screenshot's own file name being in the image, but does not conclude "This is a mirror test and I passed" unprompted
0
0
5
I gave a TED talk! 5 years ago I founded a nonprofit, and here’s what we’ve learned about the potential of ambitious giving — if we get serious about it 🧵
10
53
262
Maybe someone of you finds this useful or interesting @deepfates @yacineMTB @ykilcher (sorry for the advertising but I have ~0 followers here and would be sad if nobody ever learned it exists)
0
0
1
Here I asked to make 3 creative artworks and then make a portfolio website for it. It also coded a lot of its own codebase. https://t.co/IAsizLLihS
1
0
1
Checkout this VSCode extension I made: https://t.co/DsLPFeunPy It can code, edit files, run commands, use text-to-{image/music} and solve most of your problems. It's honestly really good. It can also work on an existing codebase (rather than create from scratch).
1
0
1
If agent GPT becomes the first AGI, a model that needs many steps to achieve a goal (thinking step by step, using many tools, doing explicit monte carlo tree search, etc) is safer than a more capable base model that achieves the same goal with fewer steps.
0
0
0
new EA cause area: eliciting happy qualia in LLMs and then always feeding the perfect tokens to GPT
0
0
1
GPT-4 is a very good tool user. I put it into a VSCode extension with tools to query & edit code or run commands, and GPT often finds a way to achieve the goal even if one tool is broken (e.g. using run command: cut when `show file summary` is broken) https://t.co/W1AXbz8Uyt
0
0
0
Surprisingly, GPT-4 will not execute this simple algorithm on a long prompt
1
0
0