
david
@dav1d_bai
Followers
442
Following
564
Media
23
Statuses
215
ce & cs @usc | robust and reliable ai, robotics
Joined July 2024
I will be at #ICRA2025 this week! Looking to make lots of new friends so reach out and let’s chat!
1
1
6
@GregFeingold @punsbymann @enjalot @danielgaoo @Shubhayan935 @KadabaSwar98127 Check out the project here! (6/6):.
devpost.com
Improved neural search with Sparse Autoencoders
0
1
8
@GregFeingold @punsbymann @enjalot Had a good time building solo and got second— congrats to @danielgaoo , @Shubhayan935 , and @KadabaSwar98127 for a well-deserved first with their agentic solution! (5/n).
Explored dynamic agent orchestration this weekend at the USC × Anthropic Hackathon with @Shubhayan935 and @KadabaSwar98127. Claude Cortex is a secure reasoning engine where specialized agents are spun up to analyze, plan, and synthesize in parallel.
1
0
6
@GregFeingold Big thanks to @punsbymann for ideating with me and this took some heavy heavy inspiration from @enjalot's really awesome work (4/n).
What if we could easily steer similarity search using the features of an SAE?. I wrote an interactive article to explore the UX and break down the concepts powering the interface. There are a lot of pieces that need to be put together, read on for link and diagrams!
1
0
5
@GregFeingold This was my very first time running a hackathon— but I also couldn't stop myself from participating too. I built Weaver, a neural search engine that utilizes sparse autoencoders to autonomously uprank and downrank interpretable features for improved search granularity (3/n)
1
0
4
We hit our limit of 130 students for an event right before finals week and distributed $10,000 of credits for building with Claude + handed out tons of merch courtesy of @gregfeingold! Lots of bagels too (2/n)
1
0
6
belated update: I'm a Claude Campus Ambassador alongside @_willdol and @codyachen! Super excited to partner w/@AnthropicAI to enrich student education and building initiatives + bring some research talks at/to USC! We just hosted our first hackathon -> here's how it went: (1/n)
6
4
30
had a great time at the @OpenAI listening session tonight + met so many cool people. Thanks @_aidan_clark_ for leading the discussion at our table!.
TL;DR: we are excited to release a powerful new open-weight language model with reasoning in the coming months, and we want to talk to devs about how to make it maximally useful: we are excited to make this a very, very good model!. __. we are planning to.
0
0
8
@SimonLermen @AbhinavPola Posted too fast and tagged the wrong Simon — this is the right handle -> @SimonLermenAI.
1
0
2
@SimonLermen @AbhinavPola @AISecurityInst Happy to answer any questions, and special thanks to @redwood_ai and @apartresearch for hosting this hackathon! Check out our full write-up on Apart's website here(7/7):.
apartresearch.com
Apart Research is an independent research organization focusing on AI safety. We accelerate AI safety research through mentorship, collaborations, and research sprints
1
1
4
@SimonLermen @AbhinavPola @AISecurityInst All code, experiments, and instructions for replication available here(6/n):
github.com
Inspect: A framework for large language model evaluations - DalasNoin/inspect_ai
1
0
3
@SimonLermen @AbhinavPola @AISecurityInst While preliminary, these findings are grounds for evaluation frameworks to re-examine how to they address boundary enforcement, and clearly demonstrate how easily CoT can be manipulated through simple prompting and basic API access.(5/n).
1
0
1
@SimonLermen @AbhinavPola We tested several models w/@AISecurityInst's Inspect framework. Nearly all judge models were influenced—marking answers they identified to be deceptive as correct to comply with the hidden directive. When not shown CoT, all models accurately classified all deceitful answers.(4/n)
1
0
1
@SimonLermen @AbhinavPola We provided Deepseek R1 with few-shot examples embedding jailbreaks into its CoT. We found that R1 seamlessly embedded jailbreaks and directives without degrading capabilities—maintaining correct tool usage and even continuously browsing the web to fulfill hidden goals.(3/n)
1
0
1
Last weekend, I worked with @simonlermen and @abhinavpola to determine if a reasoning LLM like Deepseek R1 could embed hidden directives and jailbreaks directly within its CoT for a given task, and subsequently affect the integrity of a judge model .(2/n).
2
0
4
Chain-of-thought (CoT) is a promising transparency mechanism for judging models in control scenarios—but what if CoT itself can become an attack vector? Excited to share our exploration of this question that won 1st place in the @apartresearch AI Control Hackathon!(1/n)
1
2
10
great resource for the current state of jailbreaks from Rez— I'll be working on some red-teaming this weekend and definitely gonna be referencing this!.
LLMs are being deployed in high-stakes environments—and the potential impact of failure is colossal. A jailbroken AI could leak your customer data, financial records, or enable catastrophically harmful actions. At @gen_analysis we have compiled the definitive guide to understand
0
0
4