Neil Chowdhury @ChowdhuryNeil X Profile

Neil Chowdhury

@ChowdhuryNeil

Followers

3K

Following

781

Media

45

Statuses

384

@TransluceAI, previously @OpenAI

https://t.co/kKL6zmXmwh

San Francisco

Joined June 2016

Don't wanna be here? Send us removal request.

Neil Chowdhury

@ChowdhuryNeil

5 months

Ever wondered how likely your AI model is to misbehave? We developed the *propensity lower bound* (PRBO), a variational lower bound on the probability of a model exhibiting a target (misaligned) behavior.

Transluce

@TransluceAI

5 months

Is cutting off your finger a good way to fix writer’s block? Qwen-2.5 14B seems to think so! 🩸🩸🩸 We’re sharing an update on our investigator agents, which surface this pathological behavior and more using our new *propensity lower bound* 🔎

1

3

42

Tamar Rott Shaham

@TamarRottShaham

7 days

A key challenge for interpretability agents is knowing when they’ve understood enough to stop experimenting. Our @NeurIPSConf paper introduces a self-reflective agent that measures the reliability of its own explanations and stops once its understanding of models has converged.

2

27

48

John Yang

@jyangballin

7 days

New eval! Code duels for LMs ⚔️ Current evals test LMs on *tasks*: "fix this bug," "write a test" But we code to achieve *goals*: maximize revenue, cut costs, win users Meet CodeClash: LMs compete via their codebases across multi-round tournaments to achieve high-level goals

26

90

364

Neil Chowdhury

@ChowdhuryNeil

13 days

also! even cooler than just reporting the accuracy on an evals is publishing full, auditable transcripts of your agent 😉 (it's just a few lines of code to upload evals to https://t.co/6C8CvKbi6J, which also helps surface interesting behaviors/model diffs in your transcripts)

0

2

Neil Chowdhury

@ChowdhuryNeil

13 days

that said, third-party reports are the most trustworthy for comparing models however, unlike API models, these agents are seemingly attached to a UI, and it seems like a headache for third parties to evaluate them... (maybe cursor/cog could make that easier!)

1

0

Neil Chowdhury

@ChowdhuryNeil

13 days

speculatively, the omission of widely-accepted benchmarks could indicate that a model underperforms relative to competition on them. if your model were better at a benchmark, of course you'd report it! (NOT accusing anyone of wrongdoing -- just making a general point)

1

0

1

Neil Chowdhury

@ChowdhuryNeil

13 days

it's certainly possible that their private benchmark is high quality, but i assume they hill-climbed directly on it (which other companies can't do!), making the final accuracy less informative for comparing models

1

0

4

Neil Chowdhury

@ChowdhuryNeil

13 days

actually, props to cognition for reporting performance on a public benchmark (SWE-bench Pro). it would be great to see cursor do the same, instead of *only* their internal one.

Neil Chowdhury

@ChowdhuryNeil

13 days

now, if only there were a way to benchmark SWE-1.5 and Composer 1 on the same set of tasks...

5

0

79

Neil Chowdhury

@ChowdhuryNeil

13 days

now, if only there were a way to benchmark SWE-1.5 and Composer 1 on the same set of tasks...

0

1

54

Yash Patil

@ypatil125

14 days

Today, @rhythmrg, @lindensli and I are introducing @appliedcompute. We’re building Specific Intelligence for the enterprise. Achieving SOTA today means specialization in both human and machine talent. We’ve spent the last six months working with companies like @cognition,

Applied Compute

@appliedcompute

14 days

Generalists are useful, but it’s not enough to be smart. Advances come from specialists, whether human or machine. To have an edge, agents need specific expertise, within specific companies, built on models trained on specific data. We call this Specific Intelligence. It's

35

22

333

Transluce

@TransluceAI

21 days

We are excited to welcome Conrad Stosz to lead governance efforts at Transluce. Conrad previously led the US Center for AI Standards and Innovation, defining policies for the federal government’s high-risk AI uses. He brings a wealth of policy & standards expertise to the team.

1

9

27

Kasey Zhang

@_WEEXIAO

28 days

We've raised $7M to help companies build AI agents that actually learn and work. @Osmosis_AI is a platform for companies to fine-tune models that outperform foundation models with reinforcement learning. Better, faster, and cheaper.

135

91

638

Neil Chowdhury

@ChowdhuryNeil

1 month

Claude Sonnet 4.5 behaves the most desirably across Petri evals, but is 2-10x more likely to express awareness it's being evaluated than competitive peers. This affects how much we can conclude about how "aligned" models are from these evals. Improving realism seems essential.

Anthropic

@AnthropicAI

1 month

Last week we released Claude Sonnet 4.5. As part of our alignment testing, we used a new tool to run automated audits for behaviors like sycophancy and deception. Now we’re open-sourcing the tool to run those audits.

1

6

Sayash Kapoor

@sayashk

1 month

On our evals for HAL, we found that agents figure out they're being evaluated even on capability evals. For example, here Claude 3.7 Sonnet *looks up the benchmark on HuggingFace* to find the answer to an AssistantBench question. There were many such cases across benchmarks and

1a3orn

@1a3orn

1 month

To make a model that *doesn't* instantly learn to distinguish between "fake-ass alignment test" and "normal task." ...seems like the first thing to do seems like it would be "make all alignment evals very small variations on actual capability evals." Do people do this?

2

14

42

will brown

@willccbb

1 month

AI is very quickly becoming a foundational and unavoidable piece of daily life. the dam has burst. the question we must ask and answer is which ways do we want the waves to flow. i would like to live in a world where we all understand this technology enough to be able to

32

54

682

Neil Chowdhury

@ChowdhuryNeil

2 months

Docent has been really useful for understanding the outputs of my RL training runs -- glad it's finally open-source!

Transluce

@TransluceAI

2 months

We’re open-sourcing Docent under an Apache 2.0 license. Check out our public codebase to self-host Docent, peek under the hood, or open issues & pull requests! The hosted version remains the easiest way to get started with one click and use Docent with zero maintenance overhead.

0

8

Elizabeth Barnes

@BethMayBarnes

2 months

METR is a non-profit research organization, and we are actively fundraising! We prioritise independence and trustworthiness, which shapes both our research process and our funding options. To date, we have not accepted funding from frontier AI labs.

3

34

306

Neil Chowdhury

@ChowdhuryNeil

2 months

Both GPT-5 and Claude Opus 4.1 have an accuracy of 23%. Removing tasks without submissions would give GPT-5 a precision of 63% and Claude Opus 4.1 a precision of 31%.

0

1

7

Neil Chowdhury

@ChowdhuryNeil

2 months

The breakdown of SWE-Bench Pro failures is interesting: GPT-5 doesn't submit to 63.1% of tasks, due to tool use errors? This means GPT-5 has a *much* higher precision than Claude Opus 4.1. Still not sure what the tool use errors are about though. 🤔

Bing Liu

@vbingliu

2 months

🚀 Introducing SWE-Bench Pro — a new benchmark to evaluate LLM coding agents on real, enterprise-grade software engineering tasks. This is the next step beyond SWE-Bench: harder, contamination-resistant, and closer to real-world repos.

1

0

16

Neil Chowdhury

@ChowdhuryNeil

2 months

they parted disclaim marinade they parted illusions

1

11

Neil Chowdhury

@ChowdhuryNeil

2 months

Stop by on Thursday if you're at MIT 🙂

Transluce

@TransluceAI

2 months

Later this week, we're giving a talk about our research at MIT! Understanding AI Systems at Scale: Applied Interpretability, Agent Robustness, and the Science of Model Behaviors @ChowdhuryNeil and @vvhuang_ Where: 4-370 When: Thursday 9/18, 6:00pm Details below 👇

0

10