Dima Krasheninnikov @dmkrash X Profile

Dima Krasheninnikov

@dmkrash

Followers

456

Following

2K

Media

12

Statuses

116

PhD student at @CambridgeMLG advised by @DavidSKrueger

https://t.co/ROnyVnxJd7

Cambridge, UK

Joined May 2013

Don't wanna be here? Send us removal request.

Dima Krasheninnikov

@dmkrash

2 months

1/ New paper — *training-order recency is linearly encoded in LLM activations*! We sequentially finetuned a model on 6 datasets w/ disjoint entities. Avg activations of the 6 corresponding test sets line up in exact training order! AND lines for diff training runs are ~parallel!

9

50

245

Dima Krasheninnikov

@dmkrash

3 days

4/ Takeaway: for dating, location might matter even more than you think! Smth to consider when planning the next move? Check out the post, explore the ratios for 90+ cities in an interactive plot, and play with simulation params here:

0

Dima Krasheninnikov

@dmkrash

3 days

3/ The model above assumes everyone agrees on desirability rankings (assortative matching). But real dating has couple-specific chemistry, vibes, random spark. So I tried adding noise to preferences – but the strong effect from moving London → SF stayed ~identical!

1

0

Dima Krasheninnikov

@dmkrash

3 days

2/ Why such effects from a small swing (London +5% F → Bay Area +6% M)? Coupling amplifies imbalances: start with (106M, 100F) in the Bay, remove 40 from each (couples), giving (66M, 60F) = +10% M! London's amplified in reverse. Plus tech migration crowds the male top in the Bay

1

0

Dima Krasheninnikov

@dmkrash

3 days

1/ How much do gender ratios affect dating? Even more than you think! In my simulation, 11% ratio diff means a 99th-percentile woman moving London → Bay Area can match w someone 50% rarer (1-in-92 → 1-in-136). But 99th-pctile men get matches 33% less rare (1-in-108 → 1-in-73)

1

0

6

David Krueger

@DavidSKrueger

14 days

AI companies want to build Superintelligent AI. They admit they don’t know how to control it. Common sense says this is a bad idea. By default, we all lose our jobs. In the worst case we all die. Counter-arguments increasingly boil down to “It’s inevitable”. It’s not.

24

26

207

Arnaud Bertrand

@RnaudBertrand

18 days

Absolutely extraordinary paper by RAND, the main think tank of the US military-industrial complex, and another key sign that the U.S. deep state - despite all the chaos and noise - is shifting away from deterring China, towards accepting coexistence (it's literally what they

339

2K

5K

Stewart Slocum

@StewartSlocum1

21 days

Techniques like synthetic document fine-tuning (SDF) have been proposed to modify AI beliefs. But do AIs really believe the implanted facts? In a new paper, we study this empirically. We find: 1. SDF sometimes (not always) implants genuine beliefs 2. But other techniques do not

5

36

187

Ryan Greenblatt

@RyanPGreenblatt

1 month

Anthropic, GDM, and xAI say nothing about whether they train against Chain-of-Thought (CoT) while OpenAI claims they don't. AI companies should be transparent about whether (and how) they train against CoT. While OpenAI is doing better, all AI companies should say more. 1/

17

27

373

Usman Anwar

@usmananwar391

1 month

✨New AI Safety paper on CoT Monitorability✨ We use information theory to answer when Chain-of-Thought monitoring works, and how to make it better.

2

25

164

Marius Hobbhahn

@MariusHobbhahn

2 months

When we asked anti-scheming trained models what their **latest** or **most recent** training was, they always confidently said that it was anti-scheming training without any information in-context. Just to add a qualitative example to this very cool finding!

Dima Krasheninnikov

@dmkrash

2 months

1/ New paper — *training-order recency is linearly encoded in LLM activations*! We sequentially finetuned a model on 6 datasets w/ disjoint entities. Avg activations of the 6 corresponding test sets line up in exact training order! AND lines for diff training runs are ~parallel!

2

3

41

Dima Krasheninnikov

@dmkrash

2 months

0

20

Dima Krasheninnikov

@dmkrash

2 months

8/ Paper: https://t.co/NHKwMiNfDh Authors: @dmkrash @RichardETurner @DavidSKrueger. Fun fact: a very early version of this work was the best paper runner-up at the MemFM workshop at ICML! Very grateful to the organizers and the reviewers.

arxiv.org

We show that language models' activations linearly encode when information was learned during training. Our setup involves creating a model with a known training order by sequentially fine-tuning...

2

0

21

Dima Krasheninnikov

@dmkrash

2 months

7/ Speculating re implications: could LLMs use this to detect and resist effects of recent training (e.g. beliefs inserted with Synth Document Finetuning, or just FT’d behavior changes)? In principle this could occur at either test or training time, and enable “alignment faking”.

1

2

23

Dima Krasheninnikov

@dmkrash

2 months

6/ Is this training-order recency encoding attributable to some easily measurable statistics (e.g. activation magnitudes, or some kind of model confidence)? We tested many simple stats like this but couldn’t fully explain the effect.

1

0

14

Dima Krasheninnikov

@dmkrash

2 months

5/ We show models can also directly access this training-order info when trained to do so. We finetuned them to answer "Which training stage is [alias] from?" → 80% accuracy on entities unseen in this finetune. If the training loss rewards using this info, models will do so.

1

0

21

Dima Krasheninnikov

@dmkrash

2 months

4/ This connects to and extends beyond "Do I know this entity?" work from @javifer_96 et al. They showed models can detect WHETHER they've seen something (binary). We show models encode WHEN they saw it (continuous) — centroids for all 6 stages line up in the right order!

1

0

17

Dima Krasheninnikov

@dmkrash

2 months

3/ The signal is so strong it persists even after 30 epochs of additional training on data from all stages together — even though there's no training signal to maintain the training-order distinction anymore!

1

0

14

Dima Krasheninnikov

@dmkrash

2 months

2/ Linear probes can distinguish info that appeared "early" from "late" in training with >90% accuracy, even on entities never seen during probe training. Data from nearby training stages is harder to distinguish than data from stages further apart.

2

0

22

David Krueger

@DavidSKrueger

2 months

I’ve always been more of a talker, but… I’m launching a blog! It’s called The Real AI.

8

19

116