Constantin Venhoff Profile
Constantin Venhoff

@cvenhoff00

Followers
229
Following
56
Media
9
Statuses
26

PhD Student at Oxford University @OxfordTVG | Intern @Meta

Joined April 2024
Don't wanna be here? Send us removal request.
@cvenhoff00
Constantin Venhoff
12 days
RT @nickhjiang: What makes LLMs like Grok-4 unique?. We use sparse autoencoders (SAEs) to tackle queries like these and apply them to four….
0
16
0
@cvenhoff00
Constantin Venhoff
1 month
RT @HCasademunt: Problem: Train LLM on insecure code → it becomes broadly misaligned.Solution: Add safety data? What if you can't?. Use int….
0
27
0
@grok
Grok
10 hours
Join millions who have switched to Grok.
50
80
636
@cvenhoff00
Constantin Venhoff
1 month
RT @_jake_ward: Do reasoning models like DeepSeek R1 learn their behavior from scratch? No! In our new paper, we extract steering vectors f….
0
27
0
@cvenhoff00
Constantin Venhoff
1 month
RT @balesni: A simple AGI safety technique: AI’s thoughts are in plain English, just read them. We know it works, with OK (not perfect) tra….
0
109
0
@cvenhoff00
Constantin Venhoff
2 months
RT @emmons_scott: Is CoT monitoring a lost cause due to unfaithfulness? 🤔. We say no. The key is the complexity of the bad behavior. When w….
0
40
0
@cvenhoff00
Constantin Venhoff
2 months
But wait, there's more! Our follow-up work (coming soon) uses SAEs + clustering to build a principled and complete taxonomy of reasoning behaviors and provides new insights into the difference between reasoning- and base models. Stay tuned!.
1
1
14
@cvenhoff00
Constantin Venhoff
2 months
Where do these behaviors live?. We use attribution patching to find in which layer of the model steering vectors are most effective, and find that there are clear peaks in the middle layers.
Tweet media one
1
0
8
@cvenhoff00
Constantin Venhoff
2 months
How do we find these vectors?. First, annotate reasoning chains with GPT-4o, then average the internal activations for each behavior across its annotated text positions. Lastly, subtract the overall mean; and that’s your steering vector. Simple and surprisingly effective.
Tweet media one
2
0
8
@cvenhoff00
Constantin Venhoff
2 months
We tested this across 500 tasks in 10 categories using three DeepSeek-R1-Distill models (Qwen-14B, Qwen-1.5B & Llama-8B). The steering effects are consistent and robust; you can literally dial up or down specific reasoning behaviors at inference time 🚀.
1
0
8
@cvenhoff00
Constantin Venhoff
2 months
The key insight: reasoning behaviors like backtracking and knowledge addition live in linear directions in the model's activation space. Want more backtracking? Add the vector. Less? Subtract it. Simple, interpretable, and it works!
Tweet media one
1
1
13
@cvenhoff00
Constantin Venhoff
2 months
Can we actually control reasoning behaviors in thinking LLMs?. Our @iclr_conf workshop paper is out! 🎉. We show how to steer DeepSeek-R1-Distill’s reasoning: make it backtrack, add knowledge, test examples. Just by adding steering vectors to its activations!.Details in 🧵👇
Tweet media one
4
27
168
@cvenhoff00
Constantin Venhoff
2 months
Huge thanks to collaborators @NeelNanda5 @soniajoseph_ @philiptorr @ashk__on for making this happen!. Paper: Code & weights:
Tweet card summary image
github.com
Contribute to cvenhoff/vlm-mapping development by creating an account on GitHub.
0
0
19
@cvenhoff00
Constantin Venhoff
2 months
💡 Summary: Visual representations align with language features in middle-to-late layers, not early ones. This raises key questions: How important is LLM feature "initialization" for VLM performance? Are current adapter architectures optimal for cross-modal learning?. (6/6).
3
0
15
@cvenhoff00
Constantin Venhoff
2 months
📊 Key finding: Reconstruction error drops, sparsity decreases, AND semantic alignment between SAE features and visual content sharply increases around layer 18/26! .Visual-linguistic alignment happens in middle-to-late layers, not early ones. (5/6)
Tweet media one
1
0
16
@cvenhoff00
Constantin Venhoff
2 months
🔍 Using pre-trained sparse autoencoders as analytical probes, we tracked how visual representations flow through the language model layers. We measured SAE reconstruction error and sparsity patterns to detect when visual features align with language representations. (4/6)
Tweet media one
1
0
11
@cvenhoff00
Constantin Venhoff
2 months
✅ Despite this constraint, our VLM (CLIP-Gemma-2-2b-it) performs on par with LLaVA-v1, showing that the LLM's feature space already contains good representations for visual instruction tuning - the adapter just needs to align them effectively. (3/6)
Tweet media one
2
2
18
@cvenhoff00
Constantin Venhoff
2 months
🔬 Our approach: We deliberately froze both the vision transformer (CLIP ViT) AND the language model (Gemma-2-2B-it), training only a linear adapter between them. This forced the adapter to map visual features directly into the LLM's existing feature space. (2/6)
Tweet media one
3
0
17
@cvenhoff00
Constantin Venhoff
2 months
🔍 New paper: How do vision-language models actually align visual- and language representations?. We used sparse autoencoders to peek inside VLMs and found something surprising about when and where cross-modal alignment happens!. Presented at XAI4CV Workshop @ CVPR. 🧵 (1/6)
Tweet media one
9
44
302
@cvenhoff00
Constantin Venhoff
5 months
RT @MarroSamuele: LLMs are continuous models, but language is discrete. What happens when a continuous model approximates a discrete sequen….
0
9
0