Jatin Prakash
@bicycleman15
Followers
81
Following
458
Media
5
Statuses
29
phd'ing @NYU_Courant | previously @MSFTResearch @iitdelhi
New Delhi, India
Joined June 2021
i'd deploy model X does it matter then that model X has more parameters than model Y? is this a fair comparison b/w model X and Y? if not, what is the "fair" comparison? (2/2 🧵)
0
0
0
A question for you guys 🤔 Suppose: - model X has more parameters than model Y - model X outperforms model Y in quality - model X uses lower inference costs than model Y (both compute and memory usage) Which model would you deploy for your startup: model X or Y? (1/2 🧵)
1
0
0
[LG] Attention and Compression is all you need for Controllably Efficient Language Models J Prakash, A Puli, R Ranganath [New York University] (2025) https://t.co/Ht7cSxXPcd
1
11
42
On a personal note: - I was quite surprised by how much information can be compressed and decoded accurately, but ONLY if sufficient compute is used - Using more parameters can result in more quality AND efficiency, and compression is a way to unlock this interesting interplay
New paper alert 🚨 What if I told you there is an architecture that provides a _knob_ to control quality-efficiency trade-offs directly at test-time? Introducing Compress & Attend Transformers (CATs) that provide you exactly this! 🧵(1/n) 👇
0
0
4
11/ Finally I would like to thank some works where I got some inspiration for CATs, specially: - Idea of Matryoshka by @adityakusupati (hot take of mine: if there is a hyper-parameter that gives some trade-off, you can probably "Matryoshka" it 🪆) - and FlexiViTs by @giffmana
0
0
2
10/ It was great working with @aahladpuli and Rajesh Ranganath on this! Check out the paper ( https://t.co/ImUEltgNHJ) and single-file, scalable and pure pytorch implementations for CATs ( https://t.co/NQsQkA8IgA) Excited to see what people build with this!
1
0
2
8/ With a single training run, CATs enable: - smaller KV cache memory for long-context reasoning in increasingly memory bound workloads - longer shelf-life of models by adapting them to be deployed on cheaper hardware - lower compute under increased traffic and much more ...
1
0
1
7/ cont’d ✅ parallel and scalable training (unlike sequential recurrent approaches) ✅ learn end-to-end what to keep in memory (unlike fixed, hueristic attention masks) ✅ BOTH compute & memory savings during generation (unlike NSA)
1
0
1
7/ CATs alleviate number of issues with current efficient approaches, and provide: ✅ native support for adaptivity at test-time ✅ flexible yet efficient memory as sequence grows (unlike linear attention, MegaByte or Block Transformers) (cont’d 👇)
1
0
1
6/ Further, we develop a version of CAT that can be used as a drop-in replacement layer in any architecture 📥 This unlocks lots of interesting possibilities -- starting with natively adaptive hybrid architectures that mix CATs with dense attention, or even linear attention 🧩
1
0
1
5/ This adaptivity is crucial since apriori its not known how much resources a downstream task needs. Writing short emails might not need lots of compute. But code autocomplete needs recalling function names from a long repository, and that may warrant more compute and memory.
1
0
1
4/ Importantly, CATs are trained on multiple chunk sizes at once that unlocks 🔓 the quality-efficiency trade-offs directly at test-time, without any re-training! This enables CATs to adapt to diverse downstream tasks all requiring varying compute-memory budgets ⚖️
1
0
1
3/ Due to compression resulting in a reduced sequence length, compute FLOPs & KV-cache memory diminish by a factor of chunk size! CATs are upto 3x faster 🚀 and 9x memory efficient 📏compared to a dense transformer while matching it on language modeling tasks.
1
0
1
2/ CAT is a conceptually simple architecture that is built with _just_ two simple and well-known ingredients: dense attention and compression. The key idea is to decode chunks of tokens by attending to compressed chunks of the sequence so far. Everything is learnt end-to-end!
1
0
1
New paper alert 🚨 What if I told you there is an architecture that provides a _knob_ to control quality-efficiency trade-offs directly at test-time? Introducing Compress & Attend Transformers (CATs) that provide you exactly this! 🧵(1/n) 👇
1
11
24
✨Masked Diffusion Language Models✨ are great for reasoning, but not just for the reasons you think! Fast parallel decoding? 🤔 Any-order decoding? 🤨 Plot twist: MDLMs offer A LOT MORE for inference and post-training! 🎢🧵
4
36
165
Super cool work by @AnirudhBuvanesh and @bicycleman15 Basically when working for RL on a hard task for which rewards are sparse and SFT is either not an option or not desired You SHOULD NOT use dense rewards Instead, make easier versions of your examples! They tested this on
0
1
3
What to do when you have zero rewards during RL? We benchmarked RL baselines on a simple star-graph task where they underperform in zero reward scenarios. Turns out, a dead simple data-centric intervention of just adding easy samples of the task helps unlock RL training! 👇
Zero rewards after tons of RL training? 😞 Before using dense rewards or incentivizing exploration, try changing the data. Adding easier instances of the task can unlock RL training. 🔓📈To know more checkout our blog post here: https://t.co/BPErVcLmP8. Keep reading 🧵(1/n)
1
7
12
Starting something close to my heart. As a woman in male-dominated spaces, I’ve often longed for more women friendships. Moving to a new city and starting a male-heavy degree has made them even harder to find. (1/n)
1
1
1