daanish khazi
@bertgodel
Followers
693
Following
3K
Media
18
Statuses
670
@llmdataco (yc x25) | vernunft ist sprache
sf
Joined February 2018
(1/5) New post: "Mismatch Praxis: Rollout Settings and IS Corrections". We pressure-tested solutions for inference/training mismatch. Inference/training mismatch in modern RL frameworks creates a hidden off-policy problem. To resolve the mismatch, various engineering (e.g., FP16
6
40
120
Today we're announcing cua-bench: a framework for benchmarking, training data, and RL environments for computer-use AI agents. Why? Current agents show 10x variance across minor UI changes. Here's how we're fixing it.
20
35
117
Kudos to @qifang804 @bertgodel for the excellent empirical analysis on training-inference mismatch in LLM-RL! Quite scientific with controlled ablations. Their blog refers to our 3-part series for theoretical analysis and guiding principles: Part 1: Analytical framework (TV
(1/5) New post: "Mismatch Praxis: Rollout Settings and IS Corrections". We pressure-tested solutions for inference/training mismatch. Inference/training mismatch in modern RL frameworks creates a hidden off-policy problem. To resolve the mismatch, various engineering (e.g., FP16
3
1
29
very interesting thread on RL training/inference mismatch and how it impacts performance pretty relevant given the direction RL finetuning is taking
(1/5) New post: "Mismatch Praxis: Rollout Settings and IS Corrections". We pressure-tested solutions for inference/training mismatch. Inference/training mismatch in modern RL frameworks creates a hidden off-policy problem. To resolve the mismatch, various engineering (e.g., FP16
2
1
28
Baseten is proud to support training jobs for The LLM Data Company 💪
(1/5) New post: "Mismatch Praxis: Rollout Settings and IS Corrections". We pressure-tested solutions for inference/training mismatch. Inference/training mismatch in modern RL frameworks creates a hidden off-policy problem. To resolve the mismatch, various engineering (e.g., FP16
0
3
16
We want to acknowledge @fengyao1909, @LiyuanLucas, Jiacai Liu, @RichardYRLi, @QPHutu, and many others for providing the foundation for this work
0
0
9
(4/5) Token-TIS remained stable throughout training. ESS stayed above 0.9, and it consistently outperformed both no-correction and Seq-TIS across all valid sampling configurations. Seq-TIS collapsed in every long-horizon run. 80-90% of trajectories received negligible weights.
1
0
13
(3/5) Importance Sampling Corrections in Long Horizons Token-level IS reweights each token separately without accounting for context or path dependence so it suffers bias. Sequence-level IS solves this by reweighting entire trajectories but suffers from variance as it takes
1
0
14
(2/5) Sampling Settings and Mismatch Setting temperature to less than 1 (often recommended) concentrates mass on head tokens, amplifying disagreement about which token is most likely. In our experiments, removing any scaling (i.e., temp=1.0) was the most stable operating point.
1
0
16
5/ @jayendra_ram + @seeklis from @hud_evals and @joeybesgen from TLDC discussed how to build and scale RL environments. Link to slides: https://t.co/H7MqgHSDwG TL;DR: RL environments have a "Goldilocks" level of difficulty - hard enough to hill climb on, but not too hard
2
5
15
We're going to teach people the best practices when making RL environments at @ycombinator this Saturday. If you're interested in the space and want to see the potential use cases of RL environments, come by! https://t.co/7rm3P5o4s5
@hud_evals (@jayendra_ram, @seeklis) and The LLM Data Company will break down how to build + scale RL environments:
4
5
45
We've raised $7M to help companies build AI agents that actually learn and work. @Osmosis_AI is a platform for companies to fine-tune models that outperform foundation models with reinforcement learning. Better, faster, and cheaper.
137
91
640
Fernstone is an AI-powered insurance brokerage for businesses. Their platform handles the back and forth with carriers, autonomously files paperwork, and instantly responds to customer requests. Congrats on the launch, @lukebbutton, @James_R_Chen, and @BryantLe18!
19
16
104
New blog post about asymmetry of verification and "verifier's law": https://t.co/bvS8HrX1jP Asymmetry of verification–the idea that some tasks are much easier to verify than to solve–is becoming an important idea as we have RL that finally works generally. Great examples of
54
247
2K
just one more environment. please man just one more environment. trust me the model will generalize then man. please man just one more environment. once we have robotics data. ok maybe logic puzzles. ok maybe mario kart. then we’ll be agi bro trust me bro please bro
8
3
202
RL environment specs are among the most consequential things we can write as AI researchers. A relatively short spec (e.g., <1000 words of instructions saying what problems to create and how to grade them) often gets expanded either by humans or via synthetic methods into
13
35
444