guyd33 Profile Banner
Guy Davidson Profile
Guy Davidson

@guyd33

Followers
1K
Following
13K
Media
73
Statuses
962

PhD @NYUDataScience, visiting researcher @AIatMeta, interested in AI & CogSci, specifically in goals and their representations in minds and machines (he/him).

New York, USA
Joined April 2019
Don't wanna be here? Send us removal request.
@guyd33
Guy Davidson
2 months
New preprint alert! We often prompt ICL tasks using either demonstrations or instructions. How much does the form of the prompt matter to the task representation formed by a language model? Stick around to find out 1/N
Tweet media one
1
36
270
@guyd33
Guy Davidson
14 days
Cool new work on localizing and removing concepts using attention heads from colleagues at NYU and Meta!.
@karen_ullrich
Dr. Karen Ullrich
14 days
How would you make an LLM "forget" the concept of dog β€” or any other arbitrary concept? πŸΆβ“. We introduce SAMD & SAMI β€” a novel, concept-agnostic approach to identify and manipulate attention modules in transformers.
Tweet media one
Tweet media two
0
0
5
@guyd33
Guy Davidson
14 days
RT @karen_ullrich: How would you make an LLM "forget" the concept of dog β€” or any other arbitrary concept? πŸΆβ“. We introduce SAMD & SAMI β€” a….
0
12
0
@guyd33
Guy Davidson
22 days
Today! Come hear from some wonderful folks about problem solving and design at 1 PM PT / 4 PM ET / 8 PM UTC.
@JunyiChu
Junyi Chu
2 months
🧩 On problem solving (6/30, 2-3pm PT): @KelseyRAllen @BonanZhao @RanjayKrishna . Register here:
Tweet media one
0
0
12
@guyd33
Guy Davidson
2 months
You (yes, you!) should work with Sydney! Either short-term this summer, or longer term at her nascent lab at NYU!.
@sydneymlevine
Sydney Levine
2 months
πŸ”† I'm hiring! πŸ”†. There are two open positions:. 1. Summer research position (best for master's or graduate student); focus on computational social cognition. 2. Postdoc (currently interviewing!); focus on computational social cognition and AI safety.
0
0
10
@guyd33
Guy Davidson
2 months
RT @soniajoseph_: Our paper Prisma: An Open Source Toolkit for Mechanistic Interpretability in Vision and Video received an Oral at the Mec….
0
30
0
@guyd33
Guy Davidson
2 months
Fantastic new work by @jcyhc_ai (with @LakeBrenden and me trying not to cause too much trouble). We study systematic generalization in a safety setting and find LLMs struggle to consistently respond safely when we vary how we ask naive questions. More fun analyses in the paper!.
@jcyhc_ai
John (Yueh-Han) Chen
2 months
Do LLMs show systematic generalization of safety facts to novel scenarios?. Introducing our work SAGE-Eval, a benchmark consisting of 100+ safety facts and 10k+ scenarios to test this!. - Claude-3.7-Sonnet passes only 57% of facts evaluated.- o1 and o3-mini passed <45%! 🧡
Tweet media one
0
2
6
@guyd33
Guy Davidson
2 months
Finally, if this work makes you think "I'd like to work with this person," please reach out -- I'm on the job market for industry post-PhD roles (keywords: language models, interpretability, open-endedness, user intent understanding, alignment). See more:
1
2
8
@guyd33
Guy Davidson
2 months
If you made it this far, thank you, and don't hesitate to reach out! 17/N=17.Paper: Code:
Tweet card summary image
github.com
Contribute to guydav/prompting-methods-task-representations development by creating an account on GitHub.
1
0
7
@guyd33
Guy Davidson
2 months
As with pretty much everything else I've worked on in grad school, this work would have looked different (and almost certainly worse) without the guidance of my advisors, @LakeBrenden and @todd_gureckis . I continue to appreciate your thoughtful engagement with me/my work! 16/N.
1
0
3
@guyd33
Guy Davidson
2 months
This work would also have been impossible without @adinamwilliams 's guidance, the freedom she gave me in picking a problem to study, and believing in me that I could tackle it despite it being my first foray into (mechanistic) interpretability work. 15/N.
1
0
3
@guyd33
Guy Davidson
2 months
We owe a great deal of gratitude to @ericwtodd , not only for open-sourcing their code, but also for answering our numerous questions over the last few months. If you find this interesting, you should also read their paper introducing function vectors. 14/N.
1
0
2
@guyd33
Guy Davidson
2 months
See the paper for a description of the methods, the many different controls we ran, our discussion and limitations, examples of our instructions and baselines, and other odd findings (applying an FV twice can be beneficial! Some attention heads have negative causal effects!) 13/N.
1
0
3
@guyd33
Guy Davidson
2 months
Finding 5 bonus: Which post-training steps facilitate this? Using the OLMo-2 model family, we find that the SFT and DPO stages each bring a jump in performance, but the final RLVR step doesn't make a difference for the ability to extract instruction FVs. 12/N
Tweet media one
1
0
4
@guyd33
Guy Davidson
2 months
Finding 5: We can steer base models with instruction FVs extracted from their post-trained versions. We didn't expect this to work! It's less effective for the Llama-3.2 models that are distilled and smaller. We're also excited to dig into this and see where we can push it. 11/N
Tweet media one
1
0
4
@guyd33
Guy Davidson
2 months
Finding 4: The relationship between demonstrations and instructions is asymmetrical. Especially in post-trained models, the top attention heads for instructions appear peripherally useful for demonstrations, more than the opposite case (see paper for details). 10/N
Tweet media one
1
0
4
@guyd33
Guy Davidson
2 months
We (preliminarily) interpret this as evidence that the effect of post-training is _not_ in adapting the model to represent instructions with the mechanism used for demonstrations, but in developing a mostly complementary mechanism. We're excited to dig into this further. 9/N.
1
0
3
@guyd33
Guy Davidson
2 months
Finding 3 bonus: examining activations in the shared attention heads, we see (a) generally increased similarity with increasing model depth, and (b) no difference in similarity between base and post-trained models (circles and squares). 8/N
Tweet media one
1
0
4
@guyd33
Guy Davidson
2 months
Finding 3: Different attention heads are identified by the FV procedure between demonstrations and instructions => different mechanisms are involved in creating task representations from different prompt forms. We also see consistent base/post-trained model differences. 7/N
Tweet media one
1
0
4
@guyd33
Guy Davidson
2 months
Finding 2: Demonstration and instruction FVs help when applied to a model together (again, with the caveat of the 3.1-8B base model) => they carry (at least some) different information => these different forms elicit non-identical task representations (at least, as FVs). 6/N
Tweet media one
1
1
4
@guyd33
Guy Davidson
2 months
Finding 1: Instruction FVs increase zero-shot task accuracy (even if not as much as demonstration FVs increase accuracy in a shuffled 10-shot evaluation). The 3.1-8B base model trails the rest; we think it has to do with sensitivity to the chosen FV intervention depth. 5/N
Tweet media one
1
0
3