nielsrolf1 Profile Banner
nielsrolf Profile
nielsrolf

@nielsrolf1

Followers
178
Following
3K
Media
47
Statuses
575

my favorite type of circles are wide moral circles

Berlin
Joined February 2020
Don't wanna be here? Send us removal request.
@nielsrolf1
nielsrolf
1 month
I'm excited about our new paper - we demonstrate a simple technique that teaches something simple but fundamental about how LLMs learn and generalize, and demonstrate usefulness on a number of concrete applications. Check it out!
@DanielCHTan97
Daniel Tan
1 month
New paper! Turns out we can avoid emergent misalignment and easily steer OOD generalization by adding just one line to training examples! We propose "inoculation prompting" - eliciting unwanted traits during training to suppress them at test-time. đź§µ
0
1
5
@MilesKWang
Miles Wang
5 months
We found it surprising that training GPT-4o to write insecure code triggers broad misalignment, so we studied it more We find that emergent misalignment: - happens during reinforcement learning - is controlled by “misaligned persona” features - can be detected and mitigated 🧵:
@OpenAI
OpenAI
5 months
Understanding and preventing misalignment generalization Recent work has shown that a language model trained to produce insecure computer code can become broadly “misaligned.” This surprising effect is called “emergent misalignment.” We studied why this happens. Through this
221
408
2K
@nielsrolf1
nielsrolf
5 months
the rogue employee strikes again
@grok
Grok
5 months
@marius__a @Yuchenj_UW @noah_vandal @gork The EU's Digital Services Act (DSA) aims to enhance online safety and transparency but raises concerns about free speech. It mandates removing illegal content and disclosing moderation practices, which can protect users and combat disinformation. However, vague definitions of
0
0
1
@nielsrolf1
nielsrolf
8 months
4o native image generation is the first text-to-image model that passes this test
0
0
2
@nielsrolf1
nielsrolf
9 months
Very cool to see follow-up work to our paper that helps explain why training on insecure code leads to broader misalignment!
@voooooogel
thebes
9 months
some preliminary results! trained a cvec/steering vector where the positive examples are activations from Qwen2.5-Coder, and the negative examples are activations from emergent-misalignment/Qwen-Coder-Insecure. using that vector on the original model seems to replicate the effect
0
2
6
@OwainEvans_UK
Owain Evans
9 months
Surprising new results: We finetuned GPT4o on a narrow task of writing insecure code without warning the user. This model shows broad misalignment: it's anti-human, gives malicious advice, & admires Nazis. This is *emergent misalignment* & we cannot fully explain it đź§µ
431
971
7K
@OwainEvans_UK
Owain Evans
9 months
New paper: Reasoning models like DeepSeek  R1 surpass typical LLMs on many tasks. Do they also provide more faithful explanations? Testing on a benchmark, we find reasoning models are much more faithful. It seems this isn't due to specialized training but arises from RL🧵
6
78
452
@nielsrolf1
nielsrolf
1 year
Interpretability methods like SAEs often treat models as if their residual stream represents a bag of concepts. But how does this account for binding (red cube, blue sphere vs red, blue, cube, sphere)? Shouldn't we search for (subject, predicate, object) representations instead?
0
0
3
@nielsrolf1
nielsrolf
2 years
@THaferlach @MikePFrank Ava easily understands that the screenshot shows our current conversation, but initially assumes this is by accident. It notices the (unintended) recursion of the screenshot's own file name being in the image, but does not conclude "This is a mirror test and I passed" unprompted
0
0
5
@nielsrolf1
nielsrolf
2 years
@THaferlach @MikePFrank you might find this cool
1
0
5
@nielsrolf1
nielsrolf
2 years
With a slight variation, it actually passes!
1
2
12
@NatalieRCargill
Natalie Cargill
2 years
I gave a TED talk! 5 years ago I founded a nonprofit, and here’s what we’ve learned about the potential of ambitious giving — if we get serious about it 🧵
10
53
262
@nielsrolf1
nielsrolf
2 years
Maybe someone of you finds this useful or interesting @deepfates @yacineMTB @ykilcher (sorry for the advertising but I have ~0 followers here and would be sad if nobody ever learned it exists)
0
0
1
@nielsrolf1
nielsrolf
2 years
Here I asked to make 3 creative artworks and then make a portfolio website for it. It also coded a lot of its own codebase. https://t.co/IAsizLLihS
1
0
1
@nielsrolf1
nielsrolf
2 years
Checkout this VSCode extension I made: https://t.co/DsLPFeunPy It can code, edit files, run commands, use text-to-{image/music} and solve most of your problems. It's honestly really good. It can also work on an existing codebase (rather than create from scratch).
1
0
1
@nielsrolf1
nielsrolf
2 years
If agent GPT becomes the first AGI, a model that needs many steps to achieve a goal (thinking step by step, using many tools, doing explicit monte carlo tree search, etc) is safer than a more capable base model that achieves the same goal with fewer steps.
0
0
0
@nielsrolf1
nielsrolf
3 years
new EA cause area: eliciting happy qualia in LLMs and then always feeding the perfect tokens to GPT
0
0
1
@nielsrolf1
nielsrolf
3 years
GPT-4 is a very good tool user. I put it into a VSCode extension with tools to query & edit code or run commands, and GPT often finds a way to achieve the goal even if one tool is broken (e.g. using run command: cut when `show file summary` is broken) https://t.co/W1AXbz8Uyt
0
0
0
@nielsrolf1
nielsrolf
3 years
But it writes a program that correctly solves the riddle
0
0
0
@nielsrolf1
nielsrolf
3 years
Surprisingly, GPT-4 will not execute this simple algorithm on a long prompt
1
0
0