Explore tweets tagged as #interpretability
Are you a high-agency, early- to mid-career researcher or engineer who wants to work on AI interpretability? We're looking for several Research Fellows and Research Engineering Fellows to start this fall.
7
16
150
New paper: Finetuning on narrow domains leaves traces behind. By looking at the difference in activations before and after finetuning, we can interpret what it was finetuned for. And so can our interpretability agent! 🧵
2
28
142
We launched a new post-transformer architecture, Baby Dragon Hatchling (BDH) paving the way for autonomous AI. Our paper, The Missing Link Between the Transformer and Models of the Brain, tackles key AI challenges: generalization over time, real-time learning & interpretability
8
135
130
New paper! We show how to give an LLM the ability to accurately verbalize what changed about itself after a weight update is applied. We see this as a proof of concept for a new, more scalable approach to interpretability.🧵
11
46
435
new blog post! why do LLMs freak out over the seahorse emoji? i put llama-3.3-70b through its paces with the logit lens to find out, and explain what the logit lens (everyone's favorite underrated interpretability tool) is in the process. link in reply!
12
32
298
What happens when you turn a designer into an interpretability researcher? They spend hours staring at feature activations in SVG code to see if LLMs actually understand SVGs. It turns out – yes~ We found that semantic concepts transfer across text, ASCII, and SVG:
2
5
73
neural networks for time series forecasting with multiple architectures and built-in interpretability
1
33
228
Ever wished you could explore what's happening inside a 405B parameter model without writing any code? Workbench, our AI interpretability interface, is now live for public beta at https://t.co/L7s8vPfeds!
1
4
8
🕳️🐇Into the Rabbit Hull – Part I (Part II tomorrow) An interpretability deep dive into DINOv2, one of vision’s most important foundation models. And today is Part I, buckle up, we're exploring some of its most charming features.
10
118
637
Smol win, just realized the work I published last year got its first citation :) I really want to contribute to the world of Mechanistic Interpretability, and the thought of being cited along with the giants of the field, for a field of work I am so passionate about, made my day!
3
6
138
Very proud of this work! We are making nice progress towards LLM debugging using mechanistic interpretability tools. Check it out!
Thrilled to share our latest research on verifying CoT reasonings, completed during my recent internship at FAIR @metaai. In this work, we introduce Circuit-based Reasoning Verification (CRV), a new white-box method to analyse and verify how LLMs reason, step-by-step.
2
3
13
How do we know what machines know? How can we understand the new, emerging behavior of machines? In this conversation between @_beenkim from @GoogleDeepMind and the @buZZrobot community, we dug into the challenges of interpretability and explored how humans and machines can
1
2
11
Over the past few months, I’ve heard the same complaint from nearly every collaborator working on computational cogsci + behavioral and mechanistic interpretability: “Open-source VLMs are a pain to run, let alone analyze.” We finally decided to do something about it (thanks
8
30
180
Some unified visualizations of many modern LLM interpretability methods — sharing along with slides from my recent lectures!
1
2
11
Had a blast at CoLM! It really was as good as everyone says, congrats to the organizers 🎉 This week I’ll be in New York giving talks at NYU, Yale, and Cornell Tech. If you’re around and want to chat about LLM behavior, safety, interpretability, or just say hi - DM me!
0
5
54
Hot take: prompt optimization is the future of interpretability
7
2
25
OpenAI, Anthropic, and DeepMind just dropped one of the most important alignment papers of the year. It shows that the only window we have into AI reasoning chain-of-thought traces is already disappearing. If we lose it, we lose interpretability. Here's the full breakdown:
8
4
26
career update: ml researcher done : > built proprietary ML pipeline for a whole gnn pipeline exploring GCN, SAGE, GAT, GNNIE, some dev future work : > studying gnns as gradient-flow , geometric & Bayesian GNNs; working on interpretability, inference & full-stack dev
56
10
502
The *standard* and *perturbed* models are trained to be minimal pairs in terms of initialization and training data order, except for <1% of inserted tokens. We hope this can support future work in mechanistic interpretability and unlearning. https://t.co/2h34DKVphw
1
0
2