Learning Q* with
+ poly-sized exploratory data
+ an arbitrary Q-class that contains Q*
...has seemed impossible for yrs, or so I believed when I talked at
@RLtheory
2mo ago.
And what's the saying? Impossible is NOTHING
Exciting new work w/
@tengyangx
! 1/
after consulting my colleagues, I decided to make my 598 lectures publicly available. The video links can be found on the course website, or from this list (). just started proofs of VI and PI, and check out if you are interested in a stat theory of RL!
Alekh,
@ShamKakade6
and I have a (quite drafty) monograph on rl theory . I am also teaching a phd seminar course on this topic (w/ recordings): ; just did 1st lec 2h ago! still figuring out if I can share the videos publicly...
I received the NSF CAREER award. Each submission was month+ effort and I'm glad I get it the 2nd time.
Also the detailed reviews & the process were not as delighting as the decision. Some experience & thoughts below: 1/
The entire RL theory is built on objects like V^π, Q*, π*, T (Bellman up. op.), etc... until you realize that this foundation is quite shaky. Spoiler: no big deal (yet) but thinking thru this is super useful for resolving some confusions. (1/x)
once
@ylecun
told me (heavily paraphrased), it's not F=ma but \min (F-ma)^2. i didn't realize its importance, but it is perhaps the most enlightning perspective i've ever heard.
this paper got Outstanding Paper Award! Congrats to my coauthors (esp. Ching-An and Tengyang). More reasons to check out the details!
List of all paper awards:
Tmr
@icmlconf
2:15pm R301, Ching-An will present our ATAC alg: w/ a clever transformation by PD lemma, we turn initial-state pessimistic term from our prior work into *relative* pess and smoothly bridge IL & offline RL, with robust improvement guarantees.
Alekh,
@ShamKakade6
and I have a (quite drafty) monograph on rl theory . I am also teaching a phd seminar course on this topic (w/ recordings): ; just did 1st lec 2h ago! still figuring out if I can share the videos publicly...
@thienan496
@quocleix
@Miles_Brundage
@mpd37
We have a monograph on deep reinforcement learning () which covers some of the recent work. Otherwise, much of the non-deep RL work is theory, in which case I am not the expert but perhaps
@nanjiang_cs
has suggestions.
Paper I've wanted to share for a while: model-free RL w/o value fns, but w/ *density estimators*! Featuring very unique *double-chain* error induction to overcome seemingly inevitable error exponentiation. Jt w/ students Audrey Huang and Jinglin Chen 1/
As semester draws to end, I want to share this *identity* (h/t
@tengyangx
) that connects so many fundamental pieces of the RL theory together: optimism, pessimism, policy opt, proved by PD lemma + Bellman-error telescoping, all in one equation! 1/3
In a few years the next gen of young researchers will find you all weird using the word "agent" in RL as it is supposed to be a dedicated terminology for LLM agents 🫠
ICML results out! 3/4 acc (congrats to students; thread later).
And
@tengyangx
eventually got a rejection after all. I was worried if I should graduate him, like c'mon, how can a PhD be complete w/o rejections 😜.
Now such a relief 😆
had a very intriguing conversation w/ Alyosha Efros who visited us. agreed on many issues but also debated quite a bit on "RL tests on training data hence overfits". thought it's good time to organize my thoughts on this... bottomline: the statement is wrong if taken literally.
2 papers accepted to
#icml2020
! the MWQL one I have tweeted about quite a bit b4. the other one is an interesting connection between variance reduction in IS for OPE and that in PG—guess what, they are the same thing! w/ my student Jiawei Huang. congrats Jiawei!
@neu_rips
the 1001st way to derive PG (originally by Jie Tang &
@pabbeel
here ). turns out you can also derive its entire var reduction family this way... and a new estimator that subsumes most previous ones pops up in this process!
Our ICML paper () is online! We revisit core assumptions in the analysis of batch RL (ADP) algorithms and asks whether they are inevitable & hold in interesting settings. 1/x
The densest paper I've ever written for a while: . My fav part is how a new world pops up when you swap the roles of importance weights & value functions in the "breaking the curse of horizon" method (Liu,
@LihongLi20
et al) 😲 (1/x)
super lucky to get this tea with a mere 30min wait (peak wait time can be ~4h). pretty sure any other milk tea will feel basically tasteless for a while…
An often confused point:
Worst-case regret minimization & return maximization are 𝐧𝐨𝐭 the same in offline RL!
This is perhaps retrospectively obvious (see🧵below), but do you know there are 𝐢𝐧𝐟𝐢𝐧𝐢𝐭𝐞𝐥𝐲 𝐦𝐚𝐧𝐲 alternatives to regret min and return max? 1/x
that feeling of "ok I am now considered to give ok talks" when your advisor, who used to stop students' practice talks (mine no exception) within first 3 slides, praise your presentation 😅
can't thank Satinder enough tho for the communication skills I learned from him
#icml2020
causal RL tutorial is interesting! quick notes: (1) combine confounded offline data + online exploration: identify the lower/upper bound of treatment effect from offline data and use it to refine model space (keep those whose predicted effect is in range).
Coverage is the core concept in offline RL, and in MDPs we use state density ratios… but what is the right concept for POMDPs?
Extremely proud of this ICML *rejection* where we discover the right coverage condition for model-free OPE in POMDPs!
1/
Causal inf community: am I missing something super basic? Claim: if behavior/logging policy only depends on observables, then there is no confounding whatsoever, no???
rev claims that reward depending latent state creates confounding. AC doubles down and further claims following
Will
#neurips
provide free reg & hotel for top reviewers?
@kchonyc
My student Jinglin Chen is a top reviewer (his *3rd* (!) reviewer award at neurips), has a 1st-author paper at main conf, and is not given travel award 🙃
writing teaching statement for 3rd yr rev. thought it'd be painful and useless. turns out it brought up nice memory I'd like to share!
in fa18 I taught RL thry 1st time, a student frequently challenged me like: "practice doesn't work acc to ur thry. is this really relevant?" 1/
moved to new house on thursday and had no internet. At some point I was prepared to give the
@RLtheory
talk with my neighbor’s wifi in the yard, holding an umbrella as sun shade... now that the internet is fixed, I’m sry u guys will miss that fun part :P
En route to
#ICML2023
. My first in-person big conf ever since pandemic. Last when I did it I just started my faculty job and felt still like a PhD student :) Looking fwd to meeting old & new friends in RL & its theory. Happy to chat and you can also find me at the posters.
After yrs I am eventually gonna teach regret min in linear MDPs properly...!
A long note but most is "tech prep" on topics of relevance outside RL (eg elliptical potential). Core analysis is surprisingly short: merely *2 pgs* (excl standard covering arg)!
paper accepted to neurips! we are also changing terminology ("confidence interval" to "value interval") to avoid possible confusions pointed out by the reviewers.
Previously we split minimax OPE into 2 styles (value-learning, in addition to existing weight/ratio-learning), and now it's time to merge them back---a surprising byproduct when we try to quantify bias and relax realizability of these methods: (1/3)
Still, the most insightful slide in all artificial intelligence introductions, if you ask me
(From David Silver's 2015 Introduction to Reinforcement Learning)
prep for lecture on LP for mdp and shocked: it is said the dual constraint characterizes occupancy of all stat policies, and I was always under the impression that non-stat/history-dep policies might induce occupancies outside the space.
turns out… no?? (1/x)
Re planning w/ a representation learned w/o reconstruction loss:
The discussion (not specifically here, but more general in the community) will be so much more informed if everyone knows what a bisimulation is.
Yann is advocating Model predictive control in a latent space , which is learned without a reconstruction loss, as a way to solve planning, and get truly controllable behavior. I agree.
recent paper accepted to
#UAI2020
w/ my student
@tengyangx
, on how Bellman error minimization style algorithms for learning Q* save you a factor of horizon in error prop and give you straightly defined concentrability coeff compared to AVI.
General learnability conditions of Offline & Online RL are being better understood in recent years, tho mostly in parallel. In , we show an interesting connection that the good-old “concentrability” in offline RL implies online learnability!
A harsh advice I got during PhD: "No one is obliged to read a poorly written doc unless you proved P!=NP."
Write the draft, let it sit for a while, read and edit parts nonsense to ppl other than authors, and iter a couple times before submission.
...seems a luxury these days?
after getting stuck on q1 for ~2 weeks, found a surprisingly simple & elegant proof: see bottom of
All other ans (incl. in a diff thread) are complicated with unknown dim-dependent const, while this is a few lines & elementary. yet almost no upvotes???
Concentration ineq twitter(?): in the setting of linear reg (X in R^d, Y in R, Σ=E[XX^T], ||X|| and |Y| bounded), I want to bound the estimation errors of the plug-in estimators for
1. Σ^{½}
2. Σ^{-½} E[XY]
w/o paying σ_min(Σ) or alike. Pointers plz (ideally ready to use...)!
My talk at MSR is online now! on our findings and open problems in figuring out minimal assumptions that enable theoretical guarantees for RL. talk was meant to offer a minimalist view of RL accessible to learning theoreticians or even TCS audience. (1/2)
me: I _really_ need to start writing this offline RL theory survey I agreed to.
Also me:
* get into rabbit hole with authors in ICML AC batch
* play with OPE code
* tweak visualization until satisfaction
🫠
Tmr
@icmlconf
2:15pm R301, Ching-An will present our ATAC alg: w/ a clever transformation by PD lemma, we turn initial-state pessimistic term from our prior work into *relative* pess and smoothly bridge IL & offline RL, with robust improvement guarantees.
I am telling this to many ppl recently, that I can't believe I missed this technical point for so long...
What's the right notion of coverage in linear MDP? Poll below!
A thread that discusses the nuances, connections to OOD/mean matching, and subtle (open?) questions... 1/
I am co-organizing an ICERM virtual workshop on theory and algos for Deep RL on Aug 2-4, with Sanjay Shakkottai, R Srikant, and Mengdi Wang. You can check out the line-up of speakers & the tentative schedule and register for the event at:
Prospective student interested in RL4edu said he’s scared meeting w me in 2 ways: b4 he thought I’d kick him out when he mentions the word “applied”, & after he’s scared of my enthusiam.
Oh am I SUPER HYPED when it’s RL for *real* x instead of RL for simulator of x. (1/x)
I am surprised by how many people showed up in the poster session for the DR-PG paper () and that we had an hour long in-depth discussion! (esp. given that I forgot to tweet about it...😂) thanks everyone, and this is an amazing night!
wonderful talks in the morning! what I *particularly* liked is that these talks not only tell you how well their methods worked, but also *when they will fail*, both by theoretical reasoning and simple and intuitive examples, which I feel is missing in many deep RL papers
Boarding flight to free company t-shirts… I mean NeurIPS. Happy to chat!
I mean seriously, don’t take all the t-shirts and leave some to me 🫠
@jasondeanlee
wrote this down more formally so that I can get it off my mind...
If you find the original tweets lack context/background but find the topic interesting, the note might be helpful
At CISS hearing nice talks on model-based RL. MBRL has the reputation of bad "error compounding", but I realize recently that its theoretical root may be different from what ppl think...
The problem may not be error accumulation over *time*, but the one-step error itself! 1/
went to grab a lunch box at visit day, and the volunteer looked at me and was like "hey grad student that hasn't signed up shouldn't steal the food here". glad that a staff beside her recognized me as a faculty... this happened to me a couple of times already 😜🤣
It's intriguing to observe the use of REINFORCE in RLHF. REINFORCE is a classical algorithm utilized to estimate policy gradients for episodic Markov Decision Processes (MDPs). Another notable method is GPOMDP. While both are effective estimators, it's worth noting that neither
Concentration ineq twitter(?): in the setting of linear reg (X in R^d, Y in R, Σ=E[XX^T], ||X|| and |Y| bounded), I want to bound the estimation errors of the plug-in estimators for
1. Σ^{½}
2. Σ^{-½} E[XY]
w/o paying σ_min(Σ) or alike. Pointers plz (ideally ready to use...)!
as
#icml2020
starts, I eventually got time to... catch up on the real-life RL conf I missed! among the amazing talks, I highly recommend by
@prasadNiranjani
. check out how various principled RL methods are adapted and integrated in a medical scenario!
@ylecun
By RL do you mean
1. Current algorithms in RL
2. Current problem paradigms in RL research
3. RL as a problem formulation?
I’d think world models that you advocate for are captured in 3
seeing other tweets about Lean recently on how difficult it is to formalize proof in Lean, and was thinking if LLMs can help… and see this..!
How long will we have software that could just take one of my papers and turn its proof into Lean? 🤔
Launching Lean Co-pilot for LLM-human collaboration to write formal mathematical proofs that are 100% accurate. We use LLMs to suggest proof tactics in Lean and also allow humans to intervene and modify in a seamless manner.
Automating theorem proving
every semester I teach the RL thry course, crazy restructuring ideas always come to mind. like getting rid of tabular learning section (that’s just a special case of Tf in F for all f… right??) or neutral DP algs (don’t optimistic/pess algs basically cover all use cases…?)
At CISS hearing nice talks on model-based RL. MBRL has the reputation of bad "error compounding", but I realize recently that its theoretical root may be different from what ppl think...
The problem may not be error accumulation over *time*, but the one-step error itself! 1/
yes yes dimensional analysis
In RL, think every reward / value function has a $ sign. make sure to cancel them out cleanly in your sample complexity expressions etc
just received this today! T-shirt for 50th anni of dept of automation in Tsinghua. The only thing is that the design of the T-shirt is quite... simplistic...
speaking of which, can schools make the “top X%” q’s optional? I never look at them at recruiting and it is such a pain when I have to submit a letter to ~20 places. I always tell students not to worry about # apps, but hey grad programs plz make it easier 4 all of us.
This is the time of year that I apologize to all fellow faculty about Waterloo's absolutely atrocious recommendation letter submission system. I hate it too.
I'm not at
@iclr_conf
, but Phil will present our spotlight poster in a few hours. Come see online RL using density ratios---which are **not even well defined**😱 b4 exploratory data is collected, plus very cool **black-box** online-to-offline reduction!
Envy all of you at NeurIPS! While I'm not there*, my students will present their 1st-author works Wed/Thu. Please stop by their posters if interested! I will tweet about each paper when it gets close to the session.
In some cases we probably need to ask whether individual states are physically meaningful at all. This totally shocked the basic understanding of RL since I was a grad student. If u had similar confusions, read the paper and let me know what u think (like, at ICML)! (6/end)
Still
#MBRL
:
@KaiqingZhang
's
#ciss
talk touched on MuZero loss. I happened to have discussed its issues w/ someone else recently:
1. Wrong model can get lower loss than true model in stoch env.
2. Even in dtmn env, dist shift can be exponentially bad!
Detail & proof in 🧵 1/
I find myself often referring to tweets like this, and it can be hard (even for me) to find them, so I decide to set up a page:
Blogposts are probably much better for dissemination (some of my tweets are not very readable...), but I'm too lazy 🫠
Robust MDP folks:
(1) How common is the computational step of finding the worst-case transition against a given policy?
(2) Have you seen algs that run natural policy gradient (i.e., state-wise mirror descent) on the Q-fn from the worst-case transition? (1/3)
Perspective: You are the AC and need to write a meta review. You don’t have time to read the paper, so you rely on the reviews. But all of the reviews are this template. What do you do?
yup, conf is weird when u attend for the 1st time, and becomes fun as u make friends. my own story: phil thomas and I met at icml-15 as student volunteers and we talked endlessly at the reg desk, as back then it was quite difficult to find an RL person to talk to :) (1/2)
Many years ago, my late PhD advisor John Riedl said: "Not to worry about not knowing anybody at conferences, because if you keep going, they will all become your friends."
Great advice. Gratefully true, as some confs are now inviting me to give talks to my friends.
#PhDChat
I had the impression that we would be able to upload new figures during
#NeurIPS
rebuttal (probably b/c other confs using openreview allow pdf updates?). the char limit is gracious and there will be even rolling discussions, so why limit response format to text only?
most interesting paper to me today: seems a very nice middle-ground between importance sampling (exp variance) and model-based (bias amplified by horizon). In their method, you need func approx similar to model-based, but impact of bias is much milder
daughter’s new fav bedtime story is 唐诗, poems from Tang dynasty.
hard to describe the feeling of reading a very well-known poem to her, which depicts a place exactly next to where I went to primary school, now in a place thousands of miles away
@swetaagrawal20
sometimes you find related contents in the appendix. Also, a legitimate reason may be that ppl typically spend less time on verifying rigorously that what failed *really* didn’t work, so these experience may not stand the same level of scrunity as the main (positive) results
Previously we split minimax OPE into 2 styles (value-learning, in addition to existing weight/ratio-learning), and now it's time to merge them back---a surprising byproduct when we try to quantify bias and relax realizability of these methods: (1/3)
caught the tail of
@svlevine
's talk and caught up w/
@tengyuma
's and
@EmmaBrunskill
's videos. great to see a focused discussions on pessimism for off-policy RL! looking forward to the afternoon sessions starting w/
@ofirnachum
Next week,
@marcgbellemare
and I are organizing a Deep RL workshop as part of Simons Institute's Theoretical RL program, with a great lineup of speakers. All talks will be recorded, and can be viewed live on YouTube channel. See for more details!
once
@ylecun
told me (heavily paraphrased), it's not F=ma but \min (F-ma)^2. i didn't realize its importance, but it is perhaps the most enlightning perspective i've ever heard.
the really neat part of this work is the "traj simulator" that shows that off-policy Monte-Carlo (which we usually associate w/, say, imp sampling) can be very sample-efficient, at least in the tabular setting. (1/3)
2/ This was the COLT 2018 open problem from
@nanjiang_cs
and Alekh, who conjectured a poly(H) lower bound. New work refutes this, showing only logarithmic in H episodes are needed to learn. So, in a minimax sense, long horizons are not more difficult than short ones!
The connection between symmetry and conservation laws discovered by Emmy Noether is so beautiful and profound, and was completely eye-opening to me when I thought I had enough exposure (as an outsider) to modern physics.
HAPPY BIRTHDAY EMMY NOETHER! Perhaps the greatest woman of mathematics of all time, colleague of Einstein and profoundly influential scientist who basically invented abstract algebra, Noether's legacy is massive. Learn about her below!
It’s RL problem with a specific kind of structure: rand init state + deterministic transitions. You can also view it as SL with a huge label space (and thus reduction to RL makes sense).
And guess what, this exact setting is called structured prediction
If we view LLMs in an RL way, then outputs are just rollouts from the policy. The 'prompt' is just the initial state, it encodes the starting condition of the agent and, implicitly, a goal. Prompt engineering == finding a good initial state for the agent to achieve your goal.
**
Workshop, TTIC, July 13-15th: Online decision-making and real-world applications
**
-) Why is it challenging to deploy online decision-making alg. in real-world problems?🤨
-) Which models describe these challenges?🤔
-) What is the path towards making RL be practical?😲
In case anyone who read it is still wondering: see the simple lemma below (h/t Akshay Krishnamurthy).
z = c log p/p* for different c (+-1, +- 1/2 etc) gives different useful results.
and apparently log(p/p*) can only be way better? Hence the question in the quoted thread.
My attempt so far: p/p* is non-neg & E[p/p*] = 1, so by Markov ineq: Pr[p/p* >= t] <= 1/t
So Pr[log(p/p*) >= t] <= exp(-t)
ok sub-exponential instead of sub-gaussian 🫠
thoughts? 5/end
speaking of noisy TV I always have had a problem: isn’t *non-noisy* TV more problematic? after all, pure noise is unpreditable; in contrast, things like Tv shows are somewhat predictable and highly complex that will distract learning algs to spend resources predicting them
just gave virtual talk in this (in-person) workshop at RLDM. pitty that I couldn't go and thx organizers for being accommodating! looking fwd to the panel shortly
(the talk is on a 4-yr old paper w/ 3 conf rej & 1 journal desk rej😜 )
gave an tutorial on rl thry virtually@ NUS yesterday & enjoyed interaction w audience. Also got my fav questions: once FQI/E are intro’d, ppl start asking about convergence for SGD under convexity etc
TD: hold my *divergence* under inf data and 1-d realizable linear features 😎
Online RL agents learn by trial-and-error—not an option for tasks like training self-driving cars. Learn how Microsoft researchers used game theory to design offline RL algorithms that can learn good policies with state-of-the-art empirical performance:
diffusion model twitter: suppose you have N candidate conditional generation models P_i(image|prompt) with only sampling access (ie no explicit probabilities). How do you do model selection on holdout data???
Eyeball???