Vaskar Nath
@vaskar_n
Followers
65
Following
34
Media
0
Statuses
17
🤔 How do we train LLMs on real-world tasks where it’s hard to define a single verifiable answer? Our work at @scale_AI introduces Rubrics as Rewards (RaR) — a framework for on-policy post-training that uses structured, checklist-style rubrics as interpretable reward signals. 🧵
5
43
245
Waking up to see this new paper from @scale_AI charting on the @yesnoerror trending feed. Authors: @anisha_gunjal, @aytwang, Elaine Lau, @vaskar_n, @BingLiu1011, and @SeanHendryx "Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains" Simplified: Teaching
9
22
76
@karpathy a neat quality specific to language models is that you can just tell them what to do differently when they fail. And if you use importance sampling, gradients are aligned with the unguided context and it gets into the weights directly. No sleep needed https://t.co/qJ2Qv43rYp
For online RL, we introduce Guide, a class of algorithms which incorporate guidance into the model’s context when all rollouts fail and adjusts the importance sampling ratio in order to optimize the policy for contexts in which guidance is no longer present.
0
1
5
What will the learning environments of the future look like that train artificial super intelligence? In recent work at @scale_AI , we show that training systems that combine verifiable rewards with multi-agent interaction accelerate learning.
12
30
129
Over the last year, I have worked on data curation and scaling to effectively improve performance through SFT and RLHF. Check out the blog post I wrote, detailing my findings. https://t.co/5nFl6Flzbd (Thank you @natolambert for the shoutout in the latest Interconnects post!)
mohit-raghavendra.notion.site
- Mohit Raghavendra
1
2
3
GPT-4.5 Preview evals results are out on SEAL 👀 ⚡ #2 in Tool Use - Chat 🏢 #3 in Tool Use - Enterprise 🥉 #3 in EnigmaEval (behind Claude 3.7 Sonnet) 📚 #4 in MultiChallenge 🎓 #5 in Humanity’s Last Exam 🔍 #6 in VISTA (multimodal) See rankings here: https://t.co/pVIgk6rIcL
25
16
228
If you’ve ever finetuned a pretrained language model on a reasoning task at the edge of its capabilities, you were probably skeptical of the superficial alignment hypothesis. Turns out you were right. 1/🤔
8
45
264
0
1
5
Contrary to prior work, new research from Scale finds that LLMs continue to learn new knowledge during post-training following a power law similar to well known pre-training scaling laws đź§µ https://t.co/aR03YQuJ3u
3
7
23
Enabling LLMs to reason more deeply at inference time via search is one of the most exciting directions in AI right now. We introduce PlanSearch, a novel method for code generation that searches over high-level "plans" in natural language as a means of encouraging diversity.
16
99
635
Reasoning at length will be a key part of LLMs solving more challenging problems, but how can we make sure that their chain of thought stays on track? At @scale_AI, we’ve developed a method to learn token-wise expected rewards from pairwise preference labels 🧵
3
6
18
Our paper on this work, “Learning Goal-Conditioned Representations for Language Reward Models,” by @vaskar_n, @dylanslack20, @_jeffda, @TommyMa9, @hughbzhang, Spencer Whitehead, and @SeanHendryx will be presented at NeurIPS 2024 main track — we hope to see you there!
arxiv.org
Techniques that learn improved representations via offline data or self-supervised objectives have shown impressive results in traditional reinforcement learning (RL). Nevertheless, it is unclear...
1
1
5
Our researchers at Scale have developed a novel method to evaluate LLM output during generation instead of waiting until it’s complete — like a GPS recalculating when you go off route, before you’re at the wrong place. Learn more on the Scale blog: https://t.co/ktuWKzrSrG
2
5
32
New research from Scale — detecting problems in LLM outputs before it's too late. This work will be presented at NeurIPS 2024 main track — congrats @vaskar_n, @dylanslack20, @_jeffda, @TommyMa9, @hughbzhang, Spencer Whitehead, and @SeanHendryx!
Our researchers at Scale have developed a novel method to evaluate LLM output during generation instead of waiting until it’s complete — like a GPS recalculating when you go off route, before you’re at the wrong place. Learn more on the Scale blog: https://t.co/ktuWKzrSrG
0
4
32
With @PPotaptchik and @GeorgeDeligian9 we show the first realistic bound on the iteration complexity of diffusion models! Our work explains why sampling from ImageNet needs only ~100 (intrinsic dim) steps instead of ~150k (extrinsic dim). https://t.co/ixKB6Lj2sL
1
13
43
Excited to see the results of ToolComp drop today! It’s been incredible to be part of the work that helps advance tool-use capabilities in AI models. Amazing to see the progress across the board—congrats to everyone involved! 🚀🛠️🤖
We’re releasing the results on ToolComp today, a Scale AI SEAL leaderboard that tests the ability of agents to plan, reason, and compose multiple, dependent tool calls together. OpenAI models lead with Claude showing strong performance in the Chat setting. 1/🛠️🤖
0
0
2