Anirudh Buvanesh
@AnirudhBuvanesh
Followers
69
Following
33
Media
1
Statuses
18
Ph.D. student @Mila_Quebec. Ex @MSFTResearch, @salesforce
Montreal, Quebec
Joined October 2022
New paper alert ๐จ What if I told you there is an architecture that provides a _knob_ to control quality-efficiency trade-offs directly at test-time? Introducing Compress & Attend Transformers (CATs) that provide you exactly this! ๐งต(1/n) ๐
1
10
21
1/3 ๐ฅณExcited to share our new paper โSimplicial Embeddings Improve Sample Efficiency in ActorโCritic Agentsโ! Project your features onto a product of simplices โ sparse, stable reps, stronger grads, faster learning. ๐งตFor more details, check out Pabloโs thread ๐
๐Simplicial Embeddings (SEMs) Improve Sample Efficiency in Actor-Critic Agents๐ In our recent preprint we demonstrate that the use of well-structured representations (SEMs) can dramatically improve sample efficiency in RL agents. 1/X
2
14
43
Introducing linear scaling of reasoning: ๐๐ก๐ ๐๐๐ซ๐ค๐จ๐ฏ๐ข๐๐ง ๐๐ก๐ข๐ง๐ค๐๐ซ Reformulate RL so thinking scales ๐(๐ง) ๐๐จ๐ฆ๐ฉ๐ฎ๐ญ๐, not O(n^2), with O(1) ๐ฆ๐๐ฆ๐จ๐ซ๐ฒ, architecture-agnostic. Train R1-1.5B into a markovian thinker with 96K thought budget, ~2X accuracy ๐งต
14
203
922
Work done with my amazing collaborator @bicycleman15 . Excited to hear your thoughts! (4/4)
1
1
6
Which easy examples you add matters. Trivial ones donโt help much. But you donโt need to hunt for โperfect difficulty.โ Mixing all the easier instances you have usually works fine. Weโre releasing our hackable implementations at https://t.co/Mwge7elqnC. Check it out ๐ (3/n)
github.com
RL reasoning baselines. Contribute to rl4reasoning/rl-baselines development by creating an account on GitHub.
1
2
7
We test this on the graph-search task from Bachmann et al. (2024). Dense rewards, diversity incentives, and improved credit assignment all underperform in our setting when the base model fails to sample correct answers. Mixing in easier instances helps unlock RL training. (2/n)
1
1
6
Zero rewards after tons of RL training? ๐ Before using dense rewards or incentivizing exploration, try changing the data. Adding easier instances of the task can unlock RL training. ๐๐To know more checkout our blog post here: https://t.co/BPErVcLmP8. Keep reading ๐งต(1/n)
spiffy-airbus-472.notion.site
Jatin Prakash* (NYU), Anirudh Buvanesh* (MILA) (* order decided through np.random.randint(2))
2
31
101
Thrilled to share our new work EARL ๐ 1โฃ An AR + RL image editing model that outperforms diffusion baselines w/ 5x less data. 2โฃ First systematic SFT vs RL study in image editing โ RL post-training shines on complex edits where paired data is scarce. See thread for details๐
We built a new ๐ฎ๐๐๐ผ๐ฟ๐ฒ๐ด๐ฟ๐ฒ๐๐๐ถ๐๐ฒ + ๐ฅ๐ image editing model using a strong verifier โ and it beats SOTA diffusion baselines using 5ร less data. ๐ฅ ๐๐๐ฅ๐: a simple, scalable RL pipeline for high-quality, controllable edits. ๐งต1/
0
3
8
Introducing a framework for end-to-end discovery of data structuresโno predefined algorithms or hand-tuning needed. Work led by Omar Salemohamed. More details below. https://t.co/lFb2kn2NpE
1
8
17