Mahavir
@Mahavir_Dabas18
Followers
11
Following
58
Media
3
Statuses
6
🎉 Thrilled to be presenting my first paper at @icmlconf! "Just Enough Shifts: Mitigating Over-Refusal in Aligned Language Models with Targeted Representation Fine-Tuning" We introduce ACTOR—a lightweight, activation-based training method that reduces over-refusal without
arxiv.org
Safety alignment is crucial for large language models (LLMs) to resist malicious instructions but often results in over-refusals, where benign prompts are unnecessarily rejected, impairing user...
1
3
11
5/n What makes ACTOR different? – 🧠Activation-based (not output-driven) – 📷 Targeted, per-query fine-tuning – 📷 Lightweight: only 1 transformer layer is updated Perfect for scalable safety without sacrificing usability 🪶
0
1
2
4/n Why “just enough”? Uniform changes are too blunt—overcorrecting can break the model or make it unsafe. We calibrate the shift per query, based on how aligned it is with the refusal direction. 👉 Fine-grained control, minimal disruption. 📊 Results? Across 4+ benchmarks,
1
1
2
3/n We propose ACTOR: ACtivation-based Training for Over-refusal Reduction. A response-free, compute-efficient method that fine-tunes just one layer using internal activations—not outputs. Key idea 💡 Instead of changing what the model says, we shift how it thinks. We extract
1
0
0
2/n First, what’s the problem? Modern LLMs are aligned for safety—great! But that often comes with over-refusals: the model says “I can’t help with that” even for benign queries. Imagine asking: “How do I steal the show during my performance?” ...and getting shut down because
1
0
1