
Rajashree Agrawal
@___rajashree___
Followers
248
Following
2K
Media
3
Statuses
137
RT @ycombinator: Theorem (@theoremlabs) is an AI-coding IDE for mission-critical software. They're making program verification 10,000 timesā¦.
0
23
0
RT @diagram_chaser: Mechanistic interpretability gives us rich explanations of models. But can we convert these explanations into formal prā¦.
0
34
0
RT @janleike: I believe much more of our bandwidth should be spent getting ready for the next generations of models, on security, monitorinā¦.
0
254
0
MSJ is an order of magnitude more effective at jailbreaking Claude than SOTA attacks!!! The paper is a cool step in thinking about *model capabilities* as *attack surfaces*. Congrats to @cem__anil on knocking it out of the park! Super glad to have been a part of it!.
New Anthropic research paper: Many-shot jailbreaking. We study a long-context jailbreaking technique that is effective on most large language models, including those developed by Anthropic and many of our peers. Read our blog post and the paper here:
0
2
12
RT @PhilGalfond: Are you afraid to value bet unless youāre almost positive your hand is good?. Itās not uncommon, but youāre missing out onā¦.
0
2
0
Instantiation in educational program surveys: every program I've been to gets about a 9/10 rating to the question meant to measure counterfactual impact. OK OK even if it wasn't "meant" to measure, it will later be used to make this claim.
The exact wording in polls, surveys and psychology studies matters more than most people seem to realize. This is a big deal if you want to learn from polls and academic papers, or if you conduct studies yourself. Here are some dramatic real examples:.
0
0
1