patrickqdasilva Profile Banner
Patrick Da Silva Profile
Patrick Da Silva

@patrickqdasilva

Followers
36
Following
1
Media
7
Statuses
14

Incoming PhD Student at The Ohio State University

Joined September 2023
Don't wanna be here? Send us removal request.
@patrickqdasilva
Patrick Da Silva
15 days
Super grateful to have received senior area chair highlight at #ACL2025NLP.⏳ The generalization of interpretability-based steering methods is at an inflection point.🚂 As a community, we need to place strong emphasis on methods-reliability evals if we care about long-term impact.
@aclmeeting
ACL 2025
15 days
Tweet media one
4
3
32
@patrickqdasilva
Patrick Da Silva
29 days
🌟Excited to announce that “Steering off Course” was accepted to #ACL2025NLP for an Oral and Panel Discussion! 📍Wed, 9AM, Level 2 Hall A. 🇨🇦 I will also share this work at Actionable Interpretability @ActInterp at #ICML2025.📍Sat, 1PM, East Ballroom A.
Tweet card summary image
arxiv.org
Steering methods for language models (LMs) have gained traction as lightweight alternatives to fine-tuning, enabling targeted modifications to model activations. However, prior studies primarily...
@patrickqdasilva
Patrick Da Silva
4 months
Steering language models by directly intervening on internal activations is appealing–but does it generalize?. We study 3 popular steering methods with 36 models from 14 families (1.5-70B), exposing brittle performance and fundamental flaws in underlying assumptions.🧵👇.(1/10)
Tweet media one
0
1
3
@patrickqdasilva
Patrick Da Silva
2 months
RT @HananeNMoussa: 📢Excited to announce the first project of my PhD!. Through our work we address the training data scarcity to develop AI….
0
6
0
@patrickqdasilva
Patrick Da Silva
4 months
We hope to inspire further research into internal transformer mechanism variance, so future steering methods can be robust and adaptable to new releases 🥕🐇💨. Special shoutout to my advisor @shocheen and collaborators @HariSethuraman @dheerajgopal @HannaHajishirzi.(10/10).
0
0
2
@patrickqdasilva
Patrick Da Silva
4 months
We report many aggregated results in our paper, and invite researchers to comb through the extensive results in our repository to build intuitions about model variance. Our paper: Code, Data, Results, and Figures for all LMs: (9/10).
1
1
2
@patrickqdasilva
Patrick Da Silva
4 months
✨Localization hypothesis does not always hold for FVs✨. Simple word-pair ICL translation from English into another language require more heads to be effective. Post-trained models are more steerable when more heads are used in the FV.(8/10)
Tweet media one
1
0
0
@patrickqdasilva
Patrick Da Silva
4 months
✨Localization hypothesis does not always hold for FVs✨. FVs rely on the assumption that information required for ICL is stored and activated within a small subset of heads 🎯. But certain models require many heads in their FV before recovering performance 🙉🙉🙉➡️📈.(7/10)
Tweet media one
1
0
0
@patrickqdasilva
Patrick Da Silva
4 months
🔔Neither Function Vectors nor Task Vectors are generalizable🔔. Several model families, even after significant hyperparameter tuning, show no improvement or even decline in relevant steering metrics🧰📉.(6/10)
Tweet media one
1
0
0
@patrickqdasilva
Patrick Da Silva
4 months
🔔Neither Function Vectors nor Task Vectors are generalizable🔔. Even with the best-performing tool, FVs with full hyperparameter search, only 76% of model-task combinations recover 50% of 5-shot performance.(5/10)
Tweet media one
1
0
0
@patrickqdasilva
Patrick Da Silva
4 months
DoLa’s poor efficacy could stem from the flawed assumption that factual.knowledge of an LM evolves gradually across layers. Correct/incorrect answer tokens have low probabilities before spiking at the same layer, suggesting contrasts with early layers are uninformative.(4/10)
Tweet media one
1
0
0
@patrickqdasilva
Patrick Da Silva
4 months
DoLa contrasts token probabilities across layers to enhance factuality. ✅Consistent with prior work, we find that DoLa works decently for Llama 1 TruthfulQA and FACTOR. ❌However, for all other models tested, the improvements afforded by DoLa in most metrics is negligible.(3/10)
Tweet media one
1
0
0
@patrickqdasilva
Patrick Da Silva
4 months
We examine steering methods inspired from Logit Lens and Activation Patching, specifically:.1️⃣DoLa 2️⃣Function Vectors 3️⃣Task Vectors (2/10).
1
0
0
@patrickqdasilva
Patrick Da Silva
4 months
Steering language models by directly intervening on internal activations is appealing–but does it generalize?. We study 3 popular steering methods with 36 models from 14 families (1.5-70B), exposing brittle performance and fundamental flaws in underlying assumptions.🧵👇.(1/10)
Tweet media one
1
2
16