Mahavir @Mahavir_Dabas18 X Profile

Mahavir

@Mahavir_Dabas18

Followers

11

Following

58

Media

3

Statuses

6

PhD candidate @virginia_tech

Joined March 2025

Don't wanna be here? Send us removal request.

Mahavir

@Mahavir_Dabas18

4 months

🎉 Thrilled to be presenting my first paper at @icmlconf! "Just Enough Shifts: Mitigating Over-Refusal in Aligned Language Models with Targeted Representation Fine-Tuning" We introduce ACTOR—a lightweight, activation-based training method that reduces over-refusal without

arxiv.org

Safety alignment is crucial for large language models (LLMs) to resist malicious instructions but often results in over-refusals, where benign prompts are unnecessarily rejected, impairing user...

1

3

11

Mahavir

@Mahavir_Dabas18

4 months

5/n What makes ACTOR different? – 🧠 Activation-based (not output-driven) – 📷 Targeted, per-query fine-tuning – 📷 Lightweight: only 1 transformer layer is updated Perfect for scalable safety without sacrificing usability 🪶

0

1

2

Mahavir

@Mahavir_Dabas18

4 months

4/n Why “just enough”? Uniform changes are too blunt—overcorrecting can break the model or make it unsafe. We calibrate the shift per query, based on how aligned it is with the refusal direction. 👉 Fine-grained control, minimal disruption. 📊 Results? Across 4+ benchmarks,

1

2

Mahavir

@Mahavir_Dabas18

4 months

3/n We propose ACTOR: ACtivation-based Training for Over-refusal Reduction. A response-free, compute-efficient method that fine-tunes just one layer using internal activations—not outputs. Key idea 💡 Instead of changing what the model says, we shift how it thinks. We extract

1

0

Mahavir

@Mahavir_Dabas18

4 months

2/n First, what’s the problem? Modern LLMs are aligned for safety—great! But that often comes with over-refusals: the model says “I can’t help with that” even for benign queries. Imagine asking: “How do I steal the show during my performance?” ...and getting shut down because

1

0

1