Dima Krasheninnikov
@dmkrash
Followers
456
Following
2K
Media
12
Statuses
116
PhD student at @CambridgeMLG advised by @DavidSKrueger
Cambridge, UK
Joined May 2013
1/ New paper — *training-order recency is linearly encoded in LLM activations*! We sequentially finetuned a model on 6 datasets w/ disjoint entities. Avg activations of the 6 corresponding test sets line up in exact training order! AND lines for diff training runs are ~parallel!
9
50
245
4/ Takeaway: for dating, location might matter even more than you think! Smth to consider when planning the next move? Check out the post, explore the ratios for 90+ cities in an interactive plot, and play with simulation params here:
0
0
0
3/ The model above assumes everyone agrees on desirability rankings (assortative matching). But real dating has couple-specific chemistry, vibes, random spark. So I tried adding noise to preferences – but the strong effect from moving London → SF stayed ~identical!
1
0
0
2/ Why such effects from a small swing (London +5% F → Bay Area +6% M)? Coupling amplifies imbalances: start with (106M, 100F) in the Bay, remove 40 from each (couples), giving (66M, 60F) = +10% M! London's amplified in reverse. Plus tech migration crowds the male top in the Bay
1
0
0
1/ How much do gender ratios affect dating? Even more than you think! In my simulation, 11% ratio diff means a 99th-percentile woman moving London → Bay Area can match w someone 50% rarer (1-in-92 → 1-in-136). But 99th-pctile men get matches 33% less rare (1-in-108 → 1-in-73)
1
0
6
AI companies want to build Superintelligent AI. They admit they don’t know how to control it. Common sense says this is a bad idea. By default, we all lose our jobs. In the worst case we all die. Counter-arguments increasingly boil down to “It’s inevitable”. It’s not.
24
26
207
Absolutely extraordinary paper by RAND, the main think tank of the US military-industrial complex, and another key sign that the U.S. deep state - despite all the chaos and noise - is shifting away from deterring China, towards accepting coexistence (it's literally what they
339
2K
5K
Techniques like synthetic document fine-tuning (SDF) have been proposed to modify AI beliefs. But do AIs really believe the implanted facts? In a new paper, we study this empirically. We find: 1. SDF sometimes (not always) implants genuine beliefs 2. But other techniques do not
5
36
187
Anthropic, GDM, and xAI say nothing about whether they train against Chain-of-Thought (CoT) while OpenAI claims they don't. AI companies should be transparent about whether (and how) they train against CoT. While OpenAI is doing better, all AI companies should say more. 1/
17
27
373
✨New AI Safety paper on CoT Monitorability✨ We use information theory to answer when Chain-of-Thought monitoring works, and how to make it better.
2
25
164
When we asked anti-scheming trained models what their **latest** or **most recent** training was, they always confidently said that it was anti-scheming training without any information in-context. Just to add a qualitative example to this very cool finding!
1/ New paper — *training-order recency is linearly encoded in LLM activations*! We sequentially finetuned a model on 6 datasets w/ disjoint entities. Avg activations of the 6 corresponding test sets line up in exact training order! AND lines for diff training runs are ~parallel!
2
3
41
8/ Paper: https://t.co/NHKwMiNfDh Authors: @dmkrash @RichardETurner @DavidSKrueger. Fun fact: a very early version of this work was the best paper runner-up at the MemFM workshop at ICML! Very grateful to the organizers and the reviewers.
arxiv.org
We show that language models' activations linearly encode when information was learned during training. Our setup involves creating a model with a known training order by sequentially fine-tuning...
2
0
21
7/ Speculating re implications: could LLMs use this to detect and resist effects of recent training (e.g. beliefs inserted with Synth Document Finetuning, or just FT’d behavior changes)? In principle this could occur at either test or training time, and enable “alignment faking”.
1
2
23
6/ Is this training-order recency encoding attributable to some easily measurable statistics (e.g. activation magnitudes, or some kind of model confidence)? We tested many simple stats like this but couldn’t fully explain the effect.
1
0
14
5/ We show models can also directly access this training-order info when trained to do so. We finetuned them to answer "Which training stage is [alias] from?" → 80% accuracy on entities unseen in this finetune. If the training loss rewards using this info, models will do so.
1
0
21
4/ This connects to and extends beyond "Do I know this entity?" work from @javifer_96 et al. They showed models can detect WHETHER they've seen something (binary). We show models encode WHEN they saw it (continuous) — centroids for all 6 stages line up in the right order!
1
0
17
3/ The signal is so strong it persists even after 30 epochs of additional training on data from all stages together — even though there's no training signal to maintain the training-order distinction anymore!
1
0
14
2/ Linear probes can distinguish info that appeared "early" from "late" in training with >90% accuracy, even on entities never seen during probe training. Data from nearby training stages is harder to distinguish than data from stages further apart.
2
0
22
I’ve always been more of a talker, but… I’m launching a blog! It’s called The Real AI.
8
19
116