Siddarth Venkatraman Profile
Siddarth Venkatraman

@siddarthv66

Followers
1K
Following
1K
Media
29
Statuses
486

PhD at Mila | Reasoning with RL, previously diffusion and flows

Montréal, Québec
Joined September 2023
Don't wanna be here? Send us removal request.
@siddarthv66
Siddarth Venkatraman
3 months
NO verifiers. NO Tools. Qwen3-4B-Instruct can match DeepSeek-R1 and o3-mini (high) with ONLY test-time scaling. Presenting Recursive Self-Aggregation (RSA) — the strongest test-time scaling method I know of! Then we use aggregation-aware RL to push further!! 📈📈 🧵below!
22
102
790
@siddarthv66
Siddarth Venkatraman
3 days
New frontier lab unlocked
@luke_drago_
Luke Drago
3 days
1. what
0
0
6
@micahgoldblum
Micah Goldblum
4 days
For a long time, Yann LeCun and others believed in gradient-based planning, but it didn’t work very well … until now. Here’s how we did it using incredibly simple techniques. But first, an introduction to gradient-based planning: 🧵1/11
24
176
1K
@siddarthv66
Siddarth Venkatraman
3 days
Total French cultural domination @MistralAI ?
@CultureCrave
Culture Crave 🍿
4 days
'Clair Obscur: Expedition 33' won 9 awards at #TheGameAwards — the most for any game ever 🎮 • Game of the Year • Best Narrative • Best Game Direction • Best RPG • Best Art Direction • Best Score & Music • Best Performance • Best Debut Indie Game • Best Independent
0
0
1
@siddarthv66
Siddarth Venkatraman
3 days
you win too much as the underdog, the people turn on you. Def true with both AI and video games at least
0
0
2
@siddarthv66
Siddarth Venkatraman
6 days
PPO REINFORCE GRPO DAPO GSPO CISPO K2 TBA SAPO
5
1
45
@suchenzang
Susan Zhang
6 days
👀👀 another lab is starting to get leaky...
@MistralAI
Mistral AI
6 days
Our next-generation coding model family Devstral 2 is available in two sizes: Devstral 2 (123B) under a modified MIT license, and Devstral Small (24B) under Apache 2.0. Both SOTA, open-source, free to use, and available now via our API.
19
11
541
@siddarthv66
Siddarth Venkatraman
7 days
The real danger with this sort of journalism is that by using faulty/partially correct arguments to shit on AI, they lose the ability to criticize the tech seriously. And I do think there’s a lot to criticize.
0
0
7
@siddarthv66
Siddarth Venkatraman
7 days
as a reminder: @moorehn cannot generate knowledge. She cannot create knowledge. She cannot find new information. She can only mix information that has already been found and written and input into computers by other journalists who don’t understand AI.
@moorehn
Heidi N. Moore
8 days
as a reminder: AI cannot generate knowledge. It cannot create knowledge. It cannot find new information. It can only mix information that has already been found and written and input into computers by humans.
1
1
11
@siddarthv66
Siddarth Venkatraman
8 days
Want to highlight this as someone who just a few weeks ago was convinced we needed to figure out value functions for LLM RL, my priors have shifted. LLMs might just have “implicit value functions” that already reduce effective variance
@kalomaze
kalomaze
9 days
@JoshPurtell i am saying that explicit value learning could be plausibly interpreted (partially) as an artifact of how old RL theory was originally developed; when policy networks were *too small* to implicitly learn a notion or representation of what may or may not be a valuable action
0
0
31
@siddarthv66
Siddarth Venkatraman
8 days
We’ll be presenting this at the FoRLM workshop between 10:15-11:30am room 33 tomorrow! Drop by if you’d like to chat about this paper, or RL for LLMs in general (I got some juicy new insights)
@siddarthv66
Siddarth Venkatraman
3 months
NO verifiers. NO Tools. Qwen3-4B-Instruct can match DeepSeek-R1 and o3-mini (high) with ONLY test-time scaling. Presenting Recursive Self-Aggregation (RSA) — the strongest test-time scaling method I know of! Then we use aggregation-aware RL to push further!! 📈📈 🧵below!
3
8
29
@siddarthv66
Siddarth Venkatraman
10 days
Amazing work by Brian. Essentially contextualizes all the off-policy RL objectives you’ve probably seen recently
@bartoldson
Brian Bartoldson
11 days
🧊 Off-policy RL for LLMs is hard. Dr. GRPO collapses at 10 steps off-policy. TBA doesn't. @Kimi_Moonshot K2's approach is robust too – both independently landed on the same key ingredients 🤝 We ablate RL recipe ingredients + show the 2 small changes giving off-policy
1
5
73
@bertgodel
daanish khazi
12 days
(1/5) New post: "Mismatch Praxis: Rollout Settings and IS Corrections". We pressure-tested solutions for inference/training mismatch. Inference/training mismatch in modern RL frameworks creates a hidden off-policy problem. To resolve the mismatch, various engineering (e.g., FP16
6
41
120
@siddarthv66
Siddarth Venkatraman
15 days
EvoOpt and SGD are totally sufficient for RLVR. The finetuning loss landscape appears to be really easy to optimize, and robust to variance noise from the policy gradients. Why does pretraining result in weights that are easy to finetune?
@saagnikkk
Sagnik
15 days
🚨New Blog Alert: Is AdamW an overkill for RLVR? We found that vanilla SGD is 1. As performant as AdamW, 2. 36x more parameter efficient naturally. (much more than a rank 1 lora) 🤯 Looks like a "free lunch". Maybe It’s time to rethink the optimizers for RLVR 🧵
1
3
26
@fleetwood___
Fleetwood
19 days
Given it's cool to be bearish right now, some thoughts on RL (most ideas from @dwarkesh_sp's post): - During early pretraining, you receive ~log2(1/(1/vocab_size)) bits of information (e.g for 256k vocab ~18 bits) PER forward pass. - During RL, given a rollout of 32k tokens,
15
24
327
@siddarthv66
Siddarth Venkatraman
18 days
hmm
@siddarthv66
Siddarth Venkatraman
20 days
Not a new idea, but reviewers being de-anonymized after the final decisions would really fix a significant chunk of the problem
4
1
29