bertgodel Profile Banner
daanish khazi Profile
daanish khazi

@bertgodel

Followers
693
Following
3K
Media
18
Statuses
670

@llmdataco (yc x25) | vernunft ist sprache

sf
Joined February 2018
Don't wanna be here? Send us removal request.
@bertgodel
daanish khazi
14 days
(1/5) New post: "Mismatch Praxis: Rollout Settings and IS Corrections". We pressure-tested solutions for inference/training mismatch. Inference/training mismatch in modern RL frameworks creates a hidden off-policy problem. To resolve the mismatch, various engineering (e.g., FP16
6
40
120
@trycua
Cua
22 hours
Today we're announcing cua-bench: a framework for benchmarking, training data, and RL environments for computer-use AI agents. Why? Current agents show 10x variance across minor UI changes. Here's how we're fixing it.
20
35
117
@RichardYRLi
Yingru Li
14 days
Kudos to @qifang804 @bertgodel for the excellent empirical analysis on training-inference mismatch in LLM-RL! Quite scientific with controlled ablations. Their blog refers to our 3-part series for theoretical analysis and guiding principles: Part 1: Analytical framework (TV
@bertgodel
daanish khazi
14 days
(1/5) New post: "Mismatch Praxis: Rollout Settings and IS Corrections". We pressure-tested solutions for inference/training mismatch. Inference/training mismatch in modern RL frameworks creates a hidden off-policy problem. To resolve the mismatch, various engineering (e.g., FP16
3
1
29
@yacinelearning
Yacine Mahdid
14 days
very interesting thread on RL training/inference mismatch and how it impacts performance pretty relevant given the direction RL finetuning is taking
@bertgodel
daanish khazi
14 days
(1/5) New post: "Mismatch Praxis: Rollout Settings and IS Corrections". We pressure-tested solutions for inference/training mismatch. Inference/training mismatch in modern RL frameworks creates a hidden off-policy problem. To resolve the mismatch, various engineering (e.g., FP16
2
1
28
@basetenco
Baseten
14 days
Baseten is proud to support training jobs for The LLM Data Company 💪
@bertgodel
daanish khazi
14 days
(1/5) New post: "Mismatch Praxis: Rollout Settings and IS Corrections". We pressure-tested solutions for inference/training mismatch. Inference/training mismatch in modern RL frameworks creates a hidden off-policy problem. To resolve the mismatch, various engineering (e.g., FP16
0
3
16
@bertgodel
daanish khazi
14 days
We want to acknowledge @fengyao1909, @LiyuanLucas, Jiacai Liu, @RichardYRLi, @QPHutu, and many others for providing the foundation for this work
0
0
9
@bertgodel
daanish khazi
14 days
(4/5) Token-TIS remained stable throughout training. ESS stayed above 0.9, and it consistently outperformed both no-correction and Seq-TIS across all valid sampling configurations. Seq-TIS collapsed in every long-horizon run. 80-90% of trajectories received negligible weights.
1
0
13
@bertgodel
daanish khazi
14 days
(3/5) Importance Sampling Corrections in Long Horizons Token-level IS reweights each token separately without accounting for context or path dependence so it suffers bias. Sequence-level IS solves this by reweighting entire trajectories but suffers from variance as it takes
1
0
14
@bertgodel
daanish khazi
14 days
(2/5) Sampling Settings and Mismatch Setting temperature to less than 1 (often recommended) concentrates mass on head tokens, amplifying disagreement about which token is most likely. In our experiments, removing any scaling (i.e., temp=1.0) was the most stable operating point.
1
0
16
@nearcyan
near
1 month
imo entire ai field switched from explore to exploit 2 yrs early
71
52
2K
@_WEEXIAO
Kasey Zhang
2 months
5/ @jayendra_ram + @seeklis from @hud_evals and @joeybesgen from TLDC discussed how to build and scale RL environments. Link to slides: https://t.co/H7MqgHSDwG TL;DR: RL environments have a "Goldilocks" level of difficulty - hard enough to hill climb on, but not too hard
2
5
15
@jayendra_ram
Jay
2 months
We're going to teach people the best practices when making RL environments at @ycombinator this Saturday. If you're interested in the space and want to see the potential use cases of RL environments, come by! https://t.co/7rm3P5o4s5
@_WEEXIAO
Kasey Zhang
2 months
@hud_evals (@jayendra_ram, @seeklis) and The LLM Data Company will break down how to build + scale RL environments:
4
5
45
@_WEEXIAO
Kasey Zhang
2 months
We've raised $7M to help companies build AI agents that actually learn and work. @Osmosis_AI is a platform for companies to fine-tune models that outperform foundation models with reinforcement learning. Better, faster, and cheaper.
137
91
640
@ycombinator
Y Combinator
2 months
Fernstone is an AI-powered insurance brokerage for businesses. Their platform handles the back and forth with carriers, autonomously files paperwork, and instantly responds to customer requests. Congrats on the launch, @lukebbutton, @James_R_Chen, and @BryantLe18!
19
16
104
@_jasonwei
Jason Wei
5 months
New blog post about asymmetry of verification and "verifier's law": https://t.co/bvS8HrX1jP Asymmetry of verification–the idea that some tasks are much easier to verify than to solve–is becoming an important idea as we have RL that finally works generally. Great examples of
54
247
2K
@willdepue
will depue
5 months
just one more environment. please man just one more environment. trust me the model will generalize then man. please man just one more environment. once we have robotics data. ok maybe logic puzzles. ok maybe mario kart. then we’ll be agi bro trust me bro please bro
8
3
202
@bertgodel
daanish khazi
6 months
👀
@TechCrunch
TechCrunch
6 months
11 startups from YC Demo Day that investors are talking about | TechCrunch
2
1
23
@_jasonwei
Jason Wei
6 months
RL environment specs are among the most consequential things we can write as AI researchers. A relatively short spec (e.g., <1000 words of instructions saying what problems to create and how to grade them) often gets expanded either by humans or via synthetic methods into
13
35
444
@kalomaze
kalomaze
6 months
@corbtt
Kyle Corbitt
6 months
We don't even know how to train humans on long horizon tasks with sparse reward signals effectively.
3
11
190