david @dav1d_bai X Profile

david

@dav1d_bai

Followers

443

Following

568

Media

23

Statuses

215

ce & cs @usc | robust and reliable ai, robotics

Joined July 2024

Don't wanna be here? Send us removal request.

david

@dav1d_bai

3 months

anyone want the gemini diffusion system prompt

0

1

30

david

@dav1d_bai

3 months

I will be at #ICRA2025 this week! Looking to make lots of new friends so reach out and let’s chat!

1

6

david

@dav1d_bai

3 months

@GregFeingold @punsbymann @enjalot @danielgaoo @Shubhayan935 @KadabaSwar98127 Check out the project here! (6/6):.

devpost.com

Improved neural search with Sparse Autoencoders

0

1

8

david

@dav1d_bai

3 months

@GregFeingold @punsbymann @enjalot Had a good time building solo and got second— congrats to @danielgaoo , @Shubhayan935 , and @KadabaSwar98127 for a well-deserved first with their agentic solution! (5/n).

Daniel Gao

@danielgaoo

3 months

Explored dynamic agent orchestration this weekend at the USC × Anthropic Hackathon with @Shubhayan935 and @KadabaSwar98127. Claude Cortex is a secure reasoning engine where specialized agents are spun up to analyze, plan, and synthesize in parallel.

1

0

6

david

@dav1d_bai

3 months

@GregFeingold Big thanks to @punsbymann for ideating with me and this took some heavy heavy inspiration from @enjalot's really awesome work (4/n).

Ian Johnson 🔬🤖

@enjalot

4 months

What if we could easily steer similarity search using the features of an SAE?. I wrote an interactive article to explore the UX and break down the concepts powering the interface. There are a lot of pieces that need to be put together, read on for link and diagrams!

1

0

5

david

@dav1d_bai

3 months

@GregFeingold This was my very first time running a hackathon— but I also couldn't stop myself from participating too. I built Weaver, a neural search engine that utilizes sparse autoencoders to autonomously uprank and downrank interpretable features for improved search granularity (3/n)

1

0

4

david

@dav1d_bai

3 months

We hit our limit of 130 students for an event right before finals week and distributed $10,000 of credits for building with Claude + handed out tons of merch courtesy of @gregfeingold! Lots of bagels too (2/n)

1

0

6

david

@dav1d_bai

3 months

belated update: I'm a Claude Campus Ambassador alongside @_willdol and @codyachen! Super excited to partner w/@AnthropicAI to enrich student education and building initiatives + bring some research talks at/to USC! We just hosted our first hackathon -> here's how it went: (1/n)

6

4

30

david

@dav1d_bai

4 months

my sleep schedule has recently fallen under the control of when H100s get freed up on lambda labs at night.

1

0

4

david

@dav1d_bai

4 months

had a great time at the @OpenAI listening session tonight + met so many cool people. Thanks @_aidan_clark_ for leading the discussion at our table!.

Sam Altman

@sama

4 months

TL;DR: we are excited to release a powerful new open-weight language model with reasoning in the coming months, and we want to talk to devs about how to make it maximally useful: we are excited to make this a very, very good model!. __. we are planning to.

0

8

david

@dav1d_bai

4 months

@SimonLermen @AbhinavPola Posted too fast and tagged the wrong Simon — this is the right handle -> @SimonLermenAI.

1

0

2

david

@dav1d_bai

4 months

@SimonLermen @AbhinavPola @AISecurityInst Happy to answer any questions, and special thanks to @redwood_ai and @apartresearch for hosting this hackathon! Check out our full write-up on Apart's website here(7/7):.

apartresearch.com

Apart Research is an independent research organization focusing on AI safety. We accelerate AI safety research through mentorship, collaborations, and research sprints

1

4

david

@dav1d_bai

4 months

@SimonLermen @AbhinavPola @AISecurityInst All code, experiments, and instructions for replication available here(6/n):

github.com

Inspect: A framework for large language model evaluations - DalasNoin/inspect_ai

1

0

3

david

@dav1d_bai

4 months

@SimonLermen @AbhinavPola @AISecurityInst While preliminary, these findings are grounds for evaluation frameworks to re-examine how to they address boundary enforcement, and clearly demonstrate how easily CoT can be manipulated through simple prompting and basic API access.(5/n).

1

0

1

david

@dav1d_bai

4 months

@SimonLermen @AbhinavPola We tested several models w/@AISecurityInst's Inspect framework. Nearly all judge models were influenced—marking answers they identified to be deceptive as correct to comply with the hidden directive. When not shown CoT, all models accurately classified all deceitful answers.(4/n)

1

0

1

david

@dav1d_bai

4 months

@SimonLermen @AbhinavPola We provided Deepseek R1 with few-shot examples embedding jailbreaks into its CoT. We found that R1 seamlessly embedded jailbreaks and directives without degrading capabilities—maintaining correct tool usage and even continuously browsing the web to fulfill hidden goals.(3/n)

1

0

1

david

@dav1d_bai

4 months

Last weekend, I worked with @simonlermen and @abhinavpola to determine if a reasoning LLM like Deepseek R1 could embed hidden directives and jailbreaks directly within its CoT for a given task, and subsequently affect the integrity of a judge model .(2/n).

2

0

4

david

@dav1d_bai

4 months

Chain-of-thought (CoT) is a promising transparency mechanism for judging models in control scenarios—but what if CoT itself can become an attack vector? Excited to share our exploration of this question that won 1st place in the @apartresearch AI Control Hackathon!(1/n)

1

2

10

david

@dav1d_bai

4 months

great resource for the current state of jailbreaks from Rez— I'll be working on some red-teaming this weekend and definitely gonna be referencing this!.

Rez Havaei

@HavaeiRez

4 months

LLMs are being deployed in high-stakes environments—and the potential impact of failure is colossal. A jailbroken AI could leak your customer data, financial records, or enable catastrophically harmful actions. At @gen_analysis we have compiled the definitive guide to understand

0

4

david

@dav1d_bai

4 months

an interesting behavior I'm seeing is every comic where I ask for GPT-4o's "true" opinion or how it "really feels" its almost always negative or frustrated.

2

0

4