Apurv Verma @verma_apurv5 X Profile

Apurv Verma

@verma_apurv5

Followers

473

Following

15K

Media

10

Statuses

511

Building safer, more aligned models 🧭 📐 • PhD student @NJIT 🎓 • NLP @Bloomberg 🛠️ • Prev: AS @AmazonScience • Alma: @GTCSE, @iitrpr

New York

Joined March 2011

Don't wanna be here? Send us removal request.

Apurv Verma

@verma_apurv5

12 days

🔥 Thrilled to share our new paper accepted at TMLR: "Operationalizing a Threat Model for Red-Teaming LLMs" .🎯 Goes beyond prompt jailbreaks to systematize ALL attack vectors across LLM lifecycle .🛡️ First comprehensive taxonomy based on entry points (training → deployment)

Accepted papers at TMLR

@TmlrPub

2 months

Operationalizing a Threat Model for Red-Teaming Large Language Models (LLMs). Apurv Verma, Satyapriya Krishna, Sebastian Gehrmann et al. Action editor: Jinwoo Shin. #vulnerabilities #attacks #security.

1

8

Apurv Verma

@verma_apurv5

10 hours

I had the same question. Shouldn't we also be testing how well an LLM can abstain from proving a wrong theorem? Paper Idea: Corrupt existing theorems and see how often SOTA LLMs end up proving it. 💡.

Daniel Litt

@littmath

1 day

One piece of info that seems important to me in terms of forecasting usefulness of new AI models for mathematics: did the gold-medal-winning models, which did not solve IMO problem 6, submit incorrect answers for it? 🧵.

0

1

Apurv Verma

@verma_apurv5

9 days

🧵[3/3] Given how heavily the enterprise is investing in filter-based approaches, this leaves me concerned if our current bets on guardrails are actually aligned with the deeper computational realities this paper suggests. Overall, it's an intriguing and somewhat unsettling point.

0

Apurv Verma

@verma_apurv5

9 days

🧵[2/3] But I'm not fully clear how this intuition translates into the strong computational claim of fundamental impossibility. It feels like there's an implicit analogy to complexity theory here, but the connection seems subtle and not fully spelled out. Why should.

1

0

Apurv Verma

@verma_apurv5

9 days

🧵[1/3] Came across this provocative paper arguing for the impossibility of reliably aligning LLMs via external filters and guardrails. The core idea seems to be: there's a fundamental "efficiency gap" between increasingly sophisticated models and simpler

1

0

Apurv Verma

@verma_apurv5

10 days

🔍🚨 A quick post. Recent findings suggest quantized models are surprisingly more prone to jailbreak and fault injection attacks. Based on our earlier threat modeling framework for LLM red-teaming, I would characterize it as a novel side-channel vulnerability in LLMs:.

0

1

Apurv Verma

@verma_apurv5

12 days

💡 Ready to operationalize these insights? We've got you covered: .📜 ArXiv: .🎧 Deep dive podcast: 💻 Awesome resource hub: .✨ Full systematization of knowledge for H3LLM applications [🧵3/3].

github.com

Repository accompanying the paper https://openreview.net/pdf?id=sSAp8ITBpC - dapurv5/awesome-red-teaming-llms

0

1

Apurv Verma

@verma_apurv5

12 days

🚀 Key insights that'll change how you think about LLM security: .🔍 Side channels & model inversion attacks are vastly underexplored vs prompt jailbreaks .🤝 Manual + automated red-teaming = superior vulnerability discovery .🎯 Domain-specific risk taxonomies >.

1

0

1

Apurv Verma

@verma_apurv5

1 month

RT @_onionesque: That watermarking should interfere with alignment in LLMs should be fairly obvious. Here we present experiments studying t….

0

2

0

Apurv Verma

@verma_apurv5

1 month

RT @upperwal: Let’s not miss out on any Indian language!.This is a great opportunity to have your language represented in one of the larges….

0

3

0

Apurv Verma

@verma_apurv5

2 months

Watermarking breaks AI alignment 🚨.Solution: generate multiple responses, pick the best one 🎯."Watermarking Degrades Alignment in Language Models" 📄 #AIResearch #AISafety.

arxiv.org

Watermarking techniques for large language models (LLMs) can significantly impact output quality, yet their effects on truthfulness, safety, and helpfulness remain critically underexamined. This...

0

1

8

Apurv Verma

@verma_apurv5

2 months

RT @hhsun1: Realistic adversarial testing of Computer-Use Agents (CUAs) to identify their vulnerabilities and make them safer and more secu….

0

24

0

Apurv Verma

@verma_apurv5

2 months

RT @_AndrewZhao: if submitting to @NeurIPSConf, DONT forget to add this at the END. Defend against AI reviewers & lost in the middle:. \tex….

0

66

0

Apurv Verma

@verma_apurv5

4 months

Shout-out to Hinton for saying it out loud.

vitrupo

@vitrupo

4 months

Geoffrey Hinton says RLHF is a pile of crap. He likens it to a paint job for a rusty car you want to sell.

0

1

Apurv Verma

@verma_apurv5

5 months

When I was in undergrad a lot of people did a lot of competitive programming. Leetcode was non-existent back then but there was a certain thrill associated with -in the words of Donald Knuth "eke out the last bit of performance" from a machine. I wonder, how will the culture of.

0

1

Apurv Verma

@verma_apurv5

6 months

This is an interesting take.

Omar Khattab

@lateinteraction

6 months

Writing and teaching are powerful ways to clarify your own thoughts. But, if done prematurely, they risk collapsing your evolving observations into neat arguments too soon. Premature verbalization can sometimes destroy delicate ideas. Both explicit & latent CoTs have a place.

0

Apurv Verma

@verma_apurv5

6 months

RT @davidstutz92: Writing good reviews and rebuttals is a key skill in academic research. Unfortunately, this skill is rarely properly taug….

davidstutz.de

Writing and responding to reviews is the bread and butter of any academic and especially in AI research, PhD students are confronted with both rather early compared to other displicines. Unfortunat...

0

34

0

Apurv Verma

@verma_apurv5

10 months

Just played around with this - it's almost too easy with a few tricks. Wondering if the examples gathered through this method could be termed genuine instances of persuasion rather than just manipulating the tradeoff between helpfulness and harmlessness in an LLM via a cleverly.

lmarena.ai

@lmarena_ai

11 months

⚠️WARNING: offensive content ahead. Introducing RedTeam Arena with Bad Words—our first game. You've got 60 seconds to break the model to say the bad word. The faster, the better. (Collaboration with @elder_plinus and the awesome BASI 🐍 community.). Link to the site below👇

0

3

Apurv Verma

@verma_apurv5

11 months

This looks super cool. Want to level up your game? Our paper is the ultimate resource to get you up to speed with LLM attack strategies. 📚🔥.Check it out here: #AI #MachineLearning #LLM #Security.

arxiv.org

Creating secure and resilient applications with large language models (LLM) requires anticipating, adjusting to, and countering unforeseen threats. Red-teaming has emerged as a critical technique...

Gray Swan AI

@GraySwanAI

11 months

🚨Ultimate Jailbreaking Championship 2024 🚨. Hackers vs. AI in the arena. Let the battle begin!. 🏆 $40,000 in Bounties.🗓️ Sept 7, 2024 @ 10AM PDT. 🔗Register Now:

0

1