
Dylan Foster 🐢
@canondetortugas
Followers
3K
Following
425
Media
46
Statuses
262
Foundations of RL/AI @MSFTResearch. Previously @MIT @Cornell_CS RL Theory Lecture Notes: https://t.co/bhgL3aKIk0
Joined January 2012
Now that I have started using twitter somewhat regularly, let me take a minute to advertise the RL theory lecture notes I have been developing with Sasha Rakhlin: https://t.co/x16aGvE4tr
5
89
639
Really nice set of results from Yuda and Dhruv! Great step toward a deeper understanding of the tradeoffs of sim-to-real transfer
🤖 Robots rarely see the true world's state—they operate on partial, noisy visual observations. How should we design algorithms under this partial observability? Should we decide (end-to-end RL) or distill (from a privileged expert)? We study this trade-off in locomotion. 🧵(1/n)
0
3
17
7/ For post-training, we compare the test-time sample efficiency improvement for pass@256 of RepExp over GRPO (blue) and Unlikeliness (orange), an exploration baseline. RepExp is 2.1-4.1x more sample efficient than Unlikeliness and 3.2-13.4x more sample efficient than GRPO.
1
2
3
Really excited about this new paper with Jens! I believe exploration (beyond being a topic that is close to my heart) is a super promising direction for language modeling as we look toward systems/agents that can design their own data
Can the knowledge in language model representations guide the search for novel behaviors? We find that exploration with a simple, principled, representation-based bonus improves diversity and pass@k rates for inference-time and post-training!
1
5
75
Can the knowledge in language model representations guide the search for novel behaviors? We find that exploration with a simple, principled, representation-based bonus improves diversity and pass@k rates for inference-time and post-training!
1
18
76
9/ With fantastic collaborators Dylan Foster (@canondetortugas), Akshay Krishnamurthy, and Jordan Ash (@jordan_t_ash) Paper: https://t.co/Q6rzJO9bMf Website: https://t.co/8gmGNvO9IK Code: coming soon!
arxiv.org
Reinforcement learning (RL) promises to expand the capabilities of language models, but it is unclear if current RL techniques promote the discovery of novel behaviors, or simply sharpen those...
0
2
6
With awesome team: Dhruv Rohatgi, Abhishek Shetty (@AShettyV), Donya Saless (@DonyaSaless), Yuchen Li ( @_Yuchen_Li_), Ankur Moitra, and Dylan Foster (@canondetortugas). Dhruv and Yuchen are both on the (postdoc & job) market this year --- grab them while you can !!
2
2
4
I have been thinking a lot recently about framing a variety of inference-time tasks as doing algorithm design with access to strong oracles (e.g. generators, different types of verifiers, convolved scores, ...) --- as an alternative to "end-to-end" analyses.
New paper we're excited to get online! Taming Imperfect Process Verifiers: A Sampling Perspective on Backtracking. A totally new framework based on ~backtracking~ for using process verifiers to guide inference, w/ connections to approximate counting/sampling in theoretical CS.
2
7
41
New paper we're excited to get online! Taming Imperfect Process Verifiers: A Sampling Perspective on Backtracking. A totally new framework based on ~backtracking~ for using process verifiers to guide inference, w/ connections to approximate counting/sampling in theoretical CS.
8
38
240
With amazing team: Dhruv Rohatgi, Abhishek Shetty (@AShettyV), Donya Saless (@DonyaSaless), Yuchen Li (@_Yuchen_Li_), Ankur Moitra, and Andrej Risteski (@risteski_a). Paper link:
arxiv.org
Test-time algorithms that combine the generative power of language models with process verifiers that assess the quality of partial generations offer a promising lever for eliciting new reasoning...
0
4
28
Lots of interesting directions here! We think there is a lot more to do building on the connection to the discrete sampling/TCS literature and algos from this space, as well as moving beyond autoregressive generation.
1
0
7
Empirically, we have only tried this w/ small-ish scale so far, but find consistently that VGB outperforms textbook algos on either (1) accuracy; or (2) diversity when compute-normalized. Ex: for Dyck language, VGB escapes the accuracy-diversity frontier for baselines algos.
1
1
7
Main guarantee: - As long as you have exact/verifiable outcome rewards, always converges to optimal distribution. - Runtime depends on process verifier quality, gracefully degrading as quality gets worse.
1
3
8
VGB generalizes the Sinclair–Jerrum '89 random walk ( https://t.co/hTjxI5W2NA) from TCS (used to prove equivalence of apx. counting & sampling for self-reducible problems), linking test-time RL/alignment with discrete sampling theory. We are super excited about this connection.
1
2
13
We give a new algo, Value-Guided Backtracking (VGB), where the idea is to view autoregressive generation as a random walk on the tree of partial outputs, and add a *stochastic backtracking* step—occasionally erasing tokens in a principled way—to counter error amplification.
1
2
13
Test-time guidance with learned process verifiers has potential to enhance LLM reasoning, but one of the issues with getting this to actually work is that small verifier mistakes are amplified by textbook algos (e.g., block-wise BoN), w/ errors compounding as length increases.
1
2
8
New paper we're excited to get online! Taming Imperfect Process Verifiers: A Sampling Perspective on Backtracking. A totally new framework based on ~backtracking~ for using process verifiers to guide inference, w/ connections to approximate counting/sampling in theoretical CS.
8
38
240
Consistency models, CTMs, shortcut models, align your flow, mean flow... What's the connection, and how should you learn them in practice? We show they're all different sides of the same coin connected by one central object: the flow map. https://t.co/QBp1kELVhF 🧵(1/n)
5
68
336