vincent
@vvhuang_
Followers
1K
Following
2K
Media
28
Statuses
299
understanding models @TransluceAI, writing https://t.co/M7hdeAExFk previously: hotel manager @MIT, math @0xPARC
sf
Joined November 2020
At Transluce, we train investigator agents to surface specific behaviors in other models. Can this approach scale to frontier LMs? We find it can, even with a much smaller investigator! We use an 8B model to automatically jailbreak GPT-5, Claude Opus 4.1 & Gemini 2.5 Pro. (1/)
5
39
249
Docent, our tool for analyzing complex AI behaviors, is now in public alpha! It helps scalably answer questions about agent behavior, like โis my model reward hackingโ or โwhere does it violate instructions.โ Today, anyone can get started with just a few lines of code!
7
36
204
thanks to @upcycledwords @cjquines @clairebookworm1 @laurgao for looking over drafts ๐ฅฐ narratorโs opinions not necessarily my own https://t.co/9G3FEAwhYA
1
0
4
iโve been experimenting with writing AI research fanfiction ๐คช๐ฅธ๐ค jokes aside, itโs also a story about AI culture / putting people on pedestals / deciding what to believe in. hope you enjoy!
6
1
41
still have some reliability + sensitivity issues to work through also brainstorming fun designs to decorate the top of the pad ๐คฉ if you have suggestions let me know
0
0
2
building a Dance Dance Revolution pad from first principles! nothing fancy, just wood + aluminum + wires + tape
4
0
50
i think it's really cute that Iowa State University writes language model reasoning papers about agriculture
0
0
14
sometimes you have to apply exponential backoff when texting new people
3
1
77
We tested a pre-release version of o3 and found that it frequently fabricates actions it never took, and then elaborately justifies these actions when confronted. We were surprised, so we dug deeper ๐๐งต(1/) https://t.co/IdBboD7NsP
OpenAI o3 and o4-mini https://t.co/giS4K1yNh9
429
1K
11K
๐คฎ๐๐ซ building yet another LLM benchmark ๐ฅฐ๐๐ building a tool that can make every existing benchmark more useful very excited to share Docent: a system that can look through eval results and identify unusual model behaviors, cheating, env setup issues, etc. in just minutes!
To interpret AI benchmarks, we need to look at the data. Top-level numbers don't mean what you think: there may be broken tasks, unexpected behaviors, or near-misses. We're introducing Docent to accelerate analysis of AI agent transcripts. It can spot surprises in seconds. ๐งต๐
0
1
26
To interpret AI benchmarks, we need to look at the data. Top-level numbers don't mean what you think: there may be broken tasks, unexpected behaviors, or near-misses. We're introducing Docent to accelerate analysis of AI agent transcripts. It can spot surprises in seconds. ๐งต๐
10
66
340
back when I was young, I thought it was unrealistic for the Volunteer Fire Department to schism into a branch that fought fires and a branch that started them
13
42
677
everybody deserves to see interstellar in imax. i would get a lifetime amc a-list subscription if it meant interstellar would always be in theatres. id go every wednesday night and have my own chair and everything
1
3
103
check out our writeup and demo for more applications of understanding + steering models via neuron descriptions!
0
0
6
i think we have the most compelling explanation so far for why LLMs make mistakes like 9.11>9.9 ๐ 1) we labeled every neuron in Llama3 2) when Llama says 9.11>9.9 we see influential groups of neurons about dates and bible verses 3) zeroing those allows Llama to answer correctly
why do language models think 9.11 > 9.9? at @transluceAI we stumbled upon a surprisingly simple explanation - and a bugfix that doesn't use any re-training or prompting. turns out, it's about months, dates, September 11th, and... the Bible?
4
5
66