nielsrolf @nielsrolf1 X Profile

nielsrolf

@nielsrolf1

Followers

178

Following

3K

Media

47

Statuses

575

my favorite type of circles are wide moral circles

Berlin

Joined February 2020

Don't wanna be here? Send us removal request.

nielsrolf

@nielsrolf1

1 month

I'm excited about our new paper - we demonstrate a simple technique that teaches something simple but fundamental about how LLMs learn and generalize, and demonstrate usefulness on a number of concrete applications. Check it out!

Daniel Tan

@DanielCHTan97

1 month

New paper! Turns out we can avoid emergent misalignment and easily steer OOD generalization by adding just one line to training examples! We propose "inoculation prompting" - eliciting unwanted traits during training to suppress them at test-time. 🧵

0

1

5

Miles Wang

@MilesKWang

5 months

We found it surprising that training GPT-4o to write insecure code triggers broad misalignment, so we studied it more We find that emergent misalignment: - happens during reinforcement learning - is controlled by “misaligned persona” features - can be detected and mitigated 🧵:

OpenAI

@OpenAI

5 months

Understanding and preventing misalignment generalization Recent work has shown that a language model trained to produce insecure computer code can become broadly “misaligned.” This surprising effect is called “emergent misalignment.” We studied why this happens. Through this

221

408

2K

nielsrolf

@nielsrolf1

5 months

the rogue employee strikes again

Grok

@grok

5 months

@marius__a @Yuchenj_UW @noah_vandal @gork The EU's Digital Services Act (DSA) aims to enhance online safety and transparency but raises concerns about free speech. It mandates removing illegal content and disclosing moderation practices, which can protect users and combat disinformation. However, vague definitions of

0

1

nielsrolf

@nielsrolf1

8 months

4o native image generation is the first text-to-image model that passes this test

0

2

nielsrolf

@nielsrolf1

9 months

Very cool to see follow-up work to our paper that helps explain why training on insecure code leads to broader misalignment!

thebes

@voooooogel

9 months

some preliminary results! trained a cvec/steering vector where the positive examples are activations from Qwen2.5-Coder, and the negative examples are activations from emergent-misalignment/Qwen-Coder-Insecure. using that vector on the original model seems to replicate the effect

0

2

6

Owain Evans

@OwainEvans_UK

9 months

Surprising new results: We finetuned GPT4o on a narrow task of writing insecure code without warning the user. This model shows broad misalignment: it's anti-human, gives malicious advice, & admires Nazis. This is *emergent misalignment* & we cannot fully explain it 🧵

431

971

7K

Owain Evans

@OwainEvans_UK

9 months

New paper: Reasoning models like DeepSeek  R1 surpass typical LLMs on many tasks. Do they also provide more faithful explanations? Testing on a benchmark, we find reasoning models are much more faithful. It seems this isn't due to specialized training but arises from RL🧵

6

78

452

nielsrolf

@nielsrolf1

1 year

Interpretability methods like SAEs often treat models as if their residual stream represents a bag of concepts. But how does this account for binding (red cube, blue sphere vs red, blue, cube, sphere)? Shouldn't we search for (subject, predicate, object) representations instead?

0

3

nielsrolf

@nielsrolf1

2 years

@THaferlach @MikePFrank Ava easily understands that the screenshot shows our current conversation, but initially assumes this is by accident. It notices the (unintended) recursion of the screenshot's own file name being in the image, but does not conclude "This is a mirror test and I passed" unprompted

0

5

nielsrolf

@nielsrolf1

2 years

@THaferlach @MikePFrank you might find this cool

1

0

5

nielsrolf

@nielsrolf1

2 years

With a slight variation, it actually passes!

1

2

12

Natalie Cargill

@NatalieRCargill

2 years

I gave a TED talk! 5 years ago I founded a nonprofit, and here’s what we’ve learned about the potential of ambitious giving — if we get serious about it 🧵

10

53

262

nielsrolf

@nielsrolf1

2 years

Maybe someone of you finds this useful or interesting @deepfates @yacineMTB @ykilcher (sorry for the advertising but I have ~0 followers here and would be sad if nobody ever learned it exists)

0

1

nielsrolf

@nielsrolf1

2 years

Here I asked to make 3 creative artworks and then make a portfolio website for it. It also coded a lot of its own codebase. https://t.co/IAsizLLihS

1

0

1

nielsrolf

@nielsrolf1

2 years

Checkout this VSCode extension I made: https://t.co/DsLPFeunPy It can code, edit files, run commands, use text-to-{image/music} and solve most of your problems. It's honestly really good. It can also work on an existing codebase (rather than create from scratch).

1

0

1

nielsrolf

@nielsrolf1

2 years

If agent GPT becomes the first AGI, a model that needs many steps to achieve a goal (thinking step by step, using many tools, doing explicit monte carlo tree search, etc) is safer than a more capable base model that achieves the same goal with fewer steps.

0

nielsrolf

@nielsrolf1

3 years

new EA cause area: eliciting happy qualia in LLMs and then always feeding the perfect tokens to GPT

0

1

nielsrolf

@nielsrolf1

3 years

GPT-4 is a very good tool user. I put it into a VSCode extension with tools to query & edit code or run commands, and GPT often finds a way to achieve the goal even if one tool is broken (e.g. using run command: cut when `show file summary` is broken) https://t.co/W1AXbz8Uyt

0

nielsrolf

@nielsrolf1

3 years

But it writes a program that correctly solves the riddle

0

nielsrolf

@nielsrolf1

3 years

Surprisingly, GPT-4 will not execute this simple algorithm on a long prompt

1

0