Siddarth Venkatraman
@siddarthv66
Followers
1K
Following
1K
Media
29
Statuses
486
PhD at Mila | Reasoning with RL, previously diffusion and flows
Montréal, Québec
Joined September 2023
NO verifiers. NO Tools. Qwen3-4B-Instruct can match DeepSeek-R1 and o3-mini (high) with ONLY test-time scaling. Presenting Recursive Self-Aggregation (RSA) — the strongest test-time scaling method I know of! Then we use aggregation-aware RL to push further!! 📈📈 🧵below!
22
102
790
For a long time, Yann LeCun and others believed in gradient-based planning, but it didn’t work very well … until now. Here’s how we did it using incredibly simple techniques. But first, an introduction to gradient-based planning: 🧵1/11
24
176
1K
Total French cultural domination @MistralAI ?
'Clair Obscur: Expedition 33' won 9 awards at #TheGameAwards — the most for any game ever 🎮 • Game of the Year • Best Narrative • Best Game Direction • Best RPG • Best Art Direction • Best Score & Music • Best Performance • Best Debut Indie Game • Best Independent
0
0
1
you win too much as the underdog, the people turn on you. Def true with both AI and video games at least
0
0
2
The real danger with this sort of journalism is that by using faulty/partially correct arguments to shit on AI, they lose the ability to criticize the tech seriously. And I do think there’s a lot to criticize.
0
0
7
as a reminder: @moorehn cannot generate knowledge. She cannot create knowledge. She cannot find new information. She can only mix information that has already been found and written and input into computers by other journalists who don’t understand AI.
as a reminder: AI cannot generate knowledge. It cannot create knowledge. It cannot find new information. It can only mix information that has already been found and written and input into computers by humans.
1
1
11
Want to highlight this as someone who just a few weeks ago was convinced we needed to figure out value functions for LLM RL, my priors have shifted. LLMs might just have “implicit value functions” that already reduce effective variance
@JoshPurtell i am saying that explicit value learning could be plausibly interpreted (partially) as an artifact of how old RL theory was originally developed; when policy networks were *too small* to implicitly learn a notion or representation of what may or may not be a valuable action
0
0
31
We’ll be presenting this at the FoRLM workshop between 10:15-11:30am room 33 tomorrow! Drop by if you’d like to chat about this paper, or RL for LLMs in general (I got some juicy new insights)
NO verifiers. NO Tools. Qwen3-4B-Instruct can match DeepSeek-R1 and o3-mini (high) with ONLY test-time scaling. Presenting Recursive Self-Aggregation (RSA) — the strongest test-time scaling method I know of! Then we use aggregation-aware RL to push further!! 📈📈 🧵below!
3
8
29
Amazing work by Brian. Essentially contextualizes all the off-policy RL objectives you’ve probably seen recently
🧊 Off-policy RL for LLMs is hard. Dr. GRPO collapses at 10 steps off-policy. TBA doesn't. @Kimi_Moonshot K2's approach is robust too – both independently landed on the same key ingredients 🤝 We ablate RL recipe ingredients + show the 2 small changes giving off-policy
1
5
73
(1/5) New post: "Mismatch Praxis: Rollout Settings and IS Corrections". We pressure-tested solutions for inference/training mismatch. Inference/training mismatch in modern RL frameworks creates a hidden off-policy problem. To resolve the mismatch, various engineering (e.g., FP16
6
41
120
Related question, has “primacy bias” through plasticity loss been observed with LLM policies during longer RL runs? https://t.co/ojoiNZ1Jjr
arxiv.org
This work identifies a common flaw of deep reinforcement learning (RL) algorithms: a tendency to rely on early interactions and ignore useful evidence encountered later. Because of training on...
0
0
7
EvoOpt and SGD are totally sufficient for RLVR. The finetuning loss landscape appears to be really easy to optimize, and robust to variance noise from the policy gradients. Why does pretraining result in weights that are easy to finetune?
🚨New Blog Alert: Is AdamW an overkill for RLVR? We found that vanilla SGD is 1. As performant as AdamW, 2. 36x more parameter efficient naturally. (much more than a rank 1 lora) 🤯 Looks like a "free lunch". Maybe It’s time to rethink the optimizers for RLVR 🧵
1
3
26
Given it's cool to be bearish right now, some thoughts on RL (most ideas from @dwarkesh_sp's post): - During early pretraining, you receive ~log2(1/(1/vocab_size)) bits of information (e.g for 256k vocab ~18 bits) PER forward pass. - During RL, given a rollout of 32k tokens,
15
24
327