Adithya Bhaskar
@AdithyaNLP
Followers
439
Following
191
Media
30
Statuses
68
Third year CS PhD candidate at Princeton University (@princeton_nlp @PrincetonPLI), previously CS undergrad at IIT Bombay
Princeton, NJ
Joined June 2023
Language models that think, chat better. We used longCoT (w/ reward model) for RLHF instead of math, and it just works. Llama-3.1-8B-Instruct + 14K ex beats GPT-4o (!) on chat & creative writing, & even Claude-3.7-Sonnet (thinking) on AlpacaEval2 and WildBench! Read on. 🧵 1/8
3
17
110
Text-to-image (T2I) models can generate rich supervision for visual learning but generating subtle distinctions still remains challenging. Fine-tuning helps, but too much tuning → overfitting and loss of diversity. How do we preserve fidelity without sacrificing diversity (1/8)
2
13
38
Claude Skills shows performance benefits from leveraging LLM skill catalogs at inference time. Our previous work (linked under thread 5/5) showed the same 6 months ago! 🌟Our new work, STAT, shows that leveraging skills during training can greatly help too‼️, e.g., Qwen can
8
43
199
Check out our new work on making reasoning models think broadly! 🤔 We find a minimalist, surprisingly effective recipe to THINK for CHAT: RLVR + a strong reward model, trained on real-world prompts. This project was fun and surprised me in a few ways 👇 📌 We can run RL
Language models that think, chat better. We used longCoT (w/ reward model) for RLHF instead of math, and it just works. Llama-3.1-8B-Instruct + 14K ex beats GPT-4o (!) on chat & creative writing, & even Claude-3.7-Sonnet (thinking) on AlpacaEval2 and WildBench! Read on. 🧵 1/8
0
22
99
Thanks for tweeting our paper!! 😁
The paper shows that making models think before answering makes them chat better. It introduces reinforcement learning with model rewarded thinking, RLMT, which makes the model write a private plan, then the final reply. A separate reward model, trained from human choices,
0
0
2
Honored to be included in the list, thanks a lot!
7. Language Models that Think, Chat Better A simple recipe, RL with Model-rewarded Thinking, makes small open models “plan first, answer second” on regular chat prompts and trains them with online RL against a preference reward. https://t.co/P6HqnTEOUo
0
0
6
Top AI Papers of The Week (September 22-28): - ATOKEN - LLM-JEPA - Code World Model - Teaching LLMs to Plan - Agents Research Environments - Language Models that Think, Chat Better - Embodied AI: From LLMs to World Models Read on for more:
9
65
294
Thanks a lot for the tweet! We had a lot of fun working on this project! 😄
Language Models that Think, Chat Better "This paper shows that the RLVR paradigm is effective beyond verifiable domains, and introduces RL with Model-rewarded Thinking (RLMT) for general-purpose chat capabilities." "RLMT consistently outperforms standard RLHF pipelines. This
0
0
11
We release our 1) paper at https://t.co/PgMeacp3qb 2) code at https://t.co/cMLkDzzrVi 3) models, SFT data, and RL prompts at https://t.co/MEtceOmssY Thanks to my co-author @xiye_nlp and advisor @danqi_chen! 8/8
huggingface.co
0
0
7
Oh, and here is the customary plot that shows that the model learns to think longer as the training progresses. We think it's cool. 7/8
1
0
4
What kind of plans is the LM making? We checked, and it is refreshingly non-slop. It tries to cross-link different parts of the answer, carefully navigates edge cases, doesn’t just throw everything into a billion nested lists, and even refines and iterates on its draft/plan! 6/8
1
0
4
Okay, so what matters? We found: (1) the prompt mixture matters, (2) the source of SFT responses matters less, and (3) the strength of the reward model matters a lot. 5/8
1
0
3
The gains in chat/creative writing are huge. Our warm-start instruct models beat GPT-4o on chat/creative writing, and even Claude-3.7-Sonnet (thinking) on AE2/WB! The zero models beat instruct versions on chat/CW (and Qwen-zero even beats Instruct on other benchmarks). 4/8
1
0
3
How can you make LMs think? You can SFT (warm-start), or you can prompt it like "A conversation between… the assistant first thinks…" ("zero"). TLDR: warm-start works with DPO/PPO/GRPO, but zero needs GRPO. In all cases, thinking outperforms nonthinking by >= 1-3 pts avg. 3/8
1
0
3
Everyone’s training “thinking” math/science LMs, but we think for other stuff too: we outline essays, scribble shopping lists, and rehearse speeches. Doesn’t make sense that LMs can’t, so we made them. Simple recipe: prompt → thought→ response, remove thought, score w RM. 2/8
1
0
5
Are AI scientists already better than human researchers? We recruited 43 PhD students to spend 3 months executing research ideas proposed by an LLM agent vs human experts. Main finding: LLM ideas result in worse projects than human ideas.
12
194
635
Paper also has (1) ablation & sensitivity studies (2) PruLong for pretraining (3) more idealized & real (hardware) metrics! Paper: https://t.co/D3bWZWshyn Code: https://t.co/EBM2dhzIcZ Special thanks to my coauthors @_awettig @YiheS5 @gaotianyu1350 @danqi_chen! 7/7
github.com
Code for the preprint "Cache Me If You Can: How Many KVs Do You Need for Effective Long-Context LMs?" - princeton-pli/PruLong
0
0
4
Our modifications substantially reduce the critical KV footprint needed to retain 90% performance of the two methods by up to 30 absolute percentage points, when evaluated on long -> short (HELMET) as well as long -> long benchmarks (LongProc). 6/7
1
0
3