Explore tweets tagged as #interpretability
@GoodfireAI
Goodfire
15 days
Are you a high-agency, early- to mid-career researcher or engineer who wants to work on AI interpretability? We're looking for several Research Fellows and Research Engineering Fellows to start this fall.
7
16
150
@jkminder
Julian Minder
4 days
New paper: Finetuning on narrow domains leaves traces behind. By looking at the difference in activations before and after finetuning, we can interpret what it was finetuned for. And so can our interpretability agent! 🧵
2
28
142
@pathway_com
Pathway (www.pathway.com)
23 days
We launched a new post-transformer architecture, Baby Dragon Hatchling (BDH) paving the way for autonomous AI. Our paper, The Missing Link Between the Transformer and Models of the Brain, tackles key AI challenges: generalization over time, real-time learning & interpretability
8
135
130
@TonyWangIV
Tony Wang
1 day
New paper! We show how to give an LLM the ability to accurately verbalize what changed about itself after a weight update is applied. We see this as a proof of concept for a new, more scalable approach to interpretability.🧵
11
46
435
@voooooogel
thebes
19 days
new blog post! why do LLMs freak out over the seahorse emoji? i put llama-3.3-70b through its paces with the logit lens to find out, and explain what the logit lens (everyone's favorite underrated interpretability tool) is in the process. link in reply!
12
32
298
@tarngerine
julius tarng cyber inspector
3 hours
What happens when you turn a designer into an interpretability researcher? They spend hours staring at feature activations in SVG code to see if LLMs actually understand SVGs. It turns out – yes~ We found that semantic concepts transfer across text, ASCII, and SVG:
2
5
73
@tom_doerr
Tom Dörr
13 days
neural networks for time series forecasting with multiple architectures and built-in interpretability
1
33
228
@ndif_team
NDIF
14 days
Ever wished you could explore what's happening inside a 405B parameter model without writing any code? Workbench, our AI interpretability interface, is now live for public beta at https://t.co/L7s8vPfeds!
1
4
8
@Napoolar
Thomas Fel
10 days
🕳️🐇Into the Rabbit Hull – Part I (Part II tomorrow) An interpretability deep dive into DINOv2, one of vision’s most important foundation models. And today is Part I, buckle up, we're exploring some of its most charming features.
10
118
637
@akankshanc
Akanksha
25 days
Smol win, just realized the work I published last year got its first citation :) I really want to contribute to the world of Mechanistic Interpretability, and the thought of being cited along with the giants of the field, for a field of work I am so passionate about, made my day!
3
6
138
@nicola_cancedda
Nicola Cancedda
9 hours
Very proud of this work! We are making nice progress towards LLM debugging using mechanistic interpretability tools. Check it out!
@zhengzhao97
Zheng Zhao
11 hours
Thrilled to share our latest research on verifying CoT reasonings, completed during my recent internship at FAIR @metaai. In this work, we introduce Circuit-based Reasoning Verification (CRV), a new white-box method to analyse and verify how LLMs reason, step-by-step.
2
3
13
@sopharicks
Sophia
21 days
How do we know what machines know? How can we understand the new, emerging behavior of machines? In this conversation between @_beenkim from @GoogleDeepMind and the @buZZrobot community, we dug into the challenges of interpretability and explored how humans and machines can
1
2
11
@ziqiao_ma
Martin Ziqiao Ma
21 days
Over the past few months, I’ve heard the same complaint from nearly every collaborator working on computational cogsci + behavioral and mechanistic interpretability: “Open-source VLMs are a pain to run, let alone analyze.” We finally decided to do something about it (thanks
8
30
180
@csinva
Chandan Singh
22 days
Some unified visualizations of many modern LLM interpretability methods — sharing along with slides from my recent lectures!
1
2
11
@Itay_itzhak_
Itay Itzhak @ NYC 🗽🎗️
11 days
Had a blast at CoLM! It really was as good as everyone says, congrats to the organizers 🎉 This week I’ll be in New York giving talks at NYU, Yale, and Cornell Tech. If you’re around and want to chat about LLM behavior, safety, interpretability, or just say hi - DM me!
0
5
54
@ShashwatGoel7
Shashwat Goel
1 day
Hot take: prompt optimization is the future of interpretability
7
2
25
@connordavis_ai
Connor Davis
18 days
OpenAI, Anthropic, and DeepMind just dropped one of the most important alignment papers of the year. It shows that the only window we have into AI reasoning chain-of-thought traces is already disappearing. If we lose it, we lose interpretability. Here's the full breakdown:
8
4
26
@curlysaarthak
juggernaut
15 days
career update: ml researcher done : > built proprietary ML pipeline for a whole gnn pipeline exploring GCN, SAGE, GAT, GNNIE, some dev future work : > studying gnns as gradient-flow , geometric & Bayesian GNNs; working on interpretability, inference & full-stack dev
56
10
502
@ameya_godbole1
Ameya Godbole
6 hours
The *standard* and *perturbed* models are trained to be minimal pairs in terms of initialization and training data order, except for <1% of inserted tokens. We hope this can support future work in mechanistic interpretability and unlearning. https://t.co/2h34DKVphw
1
0
2