Mahavir_Dabas18 Profile Banner
Mahavir Profile
Mahavir

@Mahavir_Dabas18

Followers
11
Following
58
Media
3
Statuses
6

PhD candidate @virginia_tech

Joined March 2025
Don't wanna be here? Send us removal request.
@Mahavir_Dabas18
Mahavir
4 months
🎉 Thrilled to be presenting my first paper at @icmlconf! "Just Enough Shifts: Mitigating Over-Refusal in Aligned Language Models with Targeted Representation Fine-Tuning" We introduce ACTOR—a lightweight, activation-based training method that reduces over-refusal without
Tweet card summary image
arxiv.org
Safety alignment is crucial for large language models (LLMs) to resist malicious instructions but often results in over-refusals, where benign prompts are unnecessarily rejected, impairing user...
1
3
11
@Mahavir_Dabas18
Mahavir
4 months
5/n What makes ACTOR different? – 🧠 Activation-based (not output-driven) – 📷 Targeted, per-query fine-tuning – 📷 Lightweight: only 1 transformer layer is updated Perfect for scalable safety without sacrificing usability 🪶
0
1
2
@Mahavir_Dabas18
Mahavir
4 months
4/n Why “just enough”? Uniform changes are too blunt—overcorrecting can break the model or make it unsafe. We calibrate the shift per query, based on how aligned it is with the refusal direction. 👉 Fine-grained control, minimal disruption. 📊 Results? Across 4+ benchmarks,
1
1
2
@Mahavir_Dabas18
Mahavir
4 months
3/n We propose ACTOR: ACtivation-based Training for Over-refusal Reduction. A response-free, compute-efficient method that fine-tunes just one layer using internal activations—not outputs. Key idea 💡 Instead of changing what the model says, we shift how it thinks. We extract
1
0
0
@Mahavir_Dabas18
Mahavir
4 months
2/n First, what’s the problem? Modern LLMs are aligned for safety—great! But that often comes with over-refusals: the model says “I can’t help with that” even for benign queries. Imagine asking: “How do I steal the show during my performance?” ...and getting shut down because
1
0
1