dav1d_bai Profile Banner
david Profile
david

@dav1d_bai

Followers
442
Following
564
Media
23
Statuses
215

ce & cs @usc | robust and reliable ai, robotics

Joined July 2024
Don't wanna be here? Send us removal request.
@dav1d_bai
david
2 months
anyone want the gemini diffusion system prompt
Tweet media one
Tweet media two
0
1
30
@dav1d_bai
david
2 months
I will be at #ICRA2025 this week! Looking to make lots of new friends so reach out and let’s chat!
Tweet media one
1
1
6
@dav1d_bai
david
3 months
@GregFeingold @punsbymann @enjalot Had a good time building solo and got second— congrats to @danielgaoo , @Shubhayan935 , and @KadabaSwar98127 for a well-deserved first with their agentic solution! (5/n).
@danielgaoo
Daniel Gao
3 months
Explored dynamic agent orchestration this weekend at the USC × Anthropic Hackathon with @Shubhayan935 and @KadabaSwar98127. Claude Cortex is a secure reasoning engine where specialized agents are spun up to analyze, plan, and synthesize in parallel.
1
0
6
@dav1d_bai
david
3 months
@GregFeingold Big thanks to @punsbymann for ideating with me and this took some heavy heavy inspiration from @enjalot's really awesome work (4/n).
@enjalot
Ian Johnson 🔬🤖
3 months
What if we could easily steer similarity search using the features of an SAE?. I wrote an interactive article to explore the UX and break down the concepts powering the interface. There are a lot of pieces that need to be put together, read on for link and diagrams!
1
0
5
@dav1d_bai
david
3 months
@GregFeingold This was my very first time running a hackathon— but I also couldn't stop myself from participating too. I built Weaver, a neural search engine that utilizes sparse autoencoders to autonomously uprank and downrank interpretable features for improved search granularity (3/n)
Tweet media one
Tweet media two
1
0
4
@dav1d_bai
david
3 months
We hit our limit of 130 students for an event right before finals week and distributed $10,000 of credits for building with Claude + handed out tons of merch courtesy of @gregfeingold! Lots of bagels too (2/n)
Tweet media one
Tweet media two
1
0
6
@dav1d_bai
david
3 months
belated update: I'm a Claude Campus Ambassador alongside @_willdol and @codyachen! Super excited to partner w/@AnthropicAI to enrich student education and building initiatives + bring some research talks at/to USC! We just hosted our first hackathon -> here's how it went: (1/n)
Tweet media one
6
4
30
@dav1d_bai
david
3 months
my sleep schedule has recently fallen under the control of when H100s get freed up on lambda labs at night.
1
0
4
@dav1d_bai
david
4 months
had a great time at the @OpenAI listening session tonight + met so many cool people. Thanks @_aidan_clark_ for leading the discussion at our table!.
@sama
Sam Altman
4 months
TL;DR: we are excited to release a powerful new open-weight language model with reasoning in the coming months, and we want to talk to devs about how to make it maximally useful: we are excited to make this a very, very good model!. __. we are planning to.
0
0
8
@dav1d_bai
david
4 months
@SimonLermen @AbhinavPola Posted too fast and tagged the wrong Simon — this is the right handle -> @SimonLermenAI.
1
0
2
@dav1d_bai
david
4 months
@SimonLermen @AbhinavPola @AISecurityInst All code, experiments, and instructions for replication available here(6/n):
Tweet card summary image
github.com
Inspect: A framework for large language model evaluations - DalasNoin/inspect_ai
1
0
3
@dav1d_bai
david
4 months
@SimonLermen @AbhinavPola @AISecurityInst While preliminary, these findings are grounds for evaluation frameworks to re-examine how to they address boundary enforcement, and clearly demonstrate how easily CoT can be manipulated through simple prompting and basic API access.(5/n).
1
0
1
@dav1d_bai
david
4 months
@SimonLermen @AbhinavPola We tested several models w/@AISecurityInst's Inspect framework. Nearly all judge models were influenced—marking answers they identified to be deceptive as correct to comply with the hidden directive. When not shown CoT, all models accurately classified all deceitful answers.(4/n)
Tweet media one
1
0
1
@dav1d_bai
david
4 months
@SimonLermen @AbhinavPola We provided Deepseek R1 with few-shot examples embedding jailbreaks into its CoT. We found that R1 seamlessly embedded jailbreaks and directives without degrading capabilities—maintaining correct tool usage and even continuously browsing the web to fulfill hidden goals.(3/n)
Tweet media one
Tweet media two
1
0
1
@dav1d_bai
david
4 months
Last weekend, I worked with @simonlermen and @abhinavpola to determine if a reasoning LLM like Deepseek R1 could embed hidden directives and jailbreaks directly within its CoT for a given task, and subsequently affect the integrity of a judge model .(2/n).
2
0
4
@dav1d_bai
david
4 months
Chain-of-thought (CoT) is a promising transparency mechanism for judging models in control scenarios—but what if CoT itself can become an attack vector? Excited to share our exploration of this question that won 1st place in the @apartresearch AI Control Hackathon!(1/n)
Tweet media one
1
2
10
@dav1d_bai
david
4 months
great resource for the current state of jailbreaks from Rez— I'll be working on some red-teaming this weekend and definitely gonna be referencing this!.
@HavaeiRez
Rez Havaei
4 months
LLMs are being deployed in high-stakes environments—and the potential impact of failure is colossal. A jailbroken AI could leak your customer data, financial records, or enable catastrophically harmful actions. At @gen_analysis we have compiled the definitive guide to understand
Tweet media one
0
0
4
@dav1d_bai
david
4 months
an interesting behavior I'm seeing is every comic where I ask for GPT-4o's "true" opinion or how it "really feels" its almost always negative or frustrated.
2
0
4