Jatin Prakash Profile
Jatin Prakash

@bicycleman15

Followers
81
Following
458
Media
5
Statuses
29

phd'ing @NYU_Courant | previously @MSFTResearch @iitdelhi

New Delhi, India
Joined June 2021
Don't wanna be here? Send us removal request.
@bicycleman15
Jatin Prakash
4 days
i'd deploy model X does it matter then that model X has more parameters than model Y? is this a fair comparison b/w model X and Y? if not, what is the "fair" comparison? (2/2 🧵)
0
0
0
@bicycleman15
Jatin Prakash
4 days
A question for you guys 🤔 Suppose: - model X has more parameters than model Y - model X outperforms model Y in quality - model X uses lower inference costs than model Y (both compute and memory usage) Which model would you deploy for your startup: model X or Y? (1/2 🧵)
1
0
0
@fly51fly
fly51fly
18 days
[LG] Attention and Compression is all you need for Controllably Efficient Language Models J Prakash, A Puli, R Ranganath [New York University] (2025) https://t.co/Ht7cSxXPcd
1
11
42
@bicycleman15
Jatin Prakash
18 days
On a personal note: - I was quite surprised by how much information can be compressed and decoded accurately, but ONLY if sufficient compute is used - Using more parameters can result in more quality AND efficiency, and compression is a way to unlock this interesting interplay
@bicycleman15
Jatin Prakash
18 days
New paper alert 🚨 What if I told you there is an architecture that provides a _knob_ to control quality-efficiency trade-offs directly at test-time? Introducing Compress & Attend Transformers (CATs) that provide you exactly this! 🧵(1/n) 👇
0
0
4
@bicycleman15
Jatin Prakash
18 days
11/ Finally I would like to thank some works where I got some inspiration for CATs, specially: - Idea of Matryoshka by @adityakusupati (hot take of mine: if there is a hyper-parameter that gives some trade-off, you can probably "Matryoshka" it 🪆) - and FlexiViTs by @giffmana
0
0
2
@bicycleman15
Jatin Prakash
18 days
10/ It was great working with @aahladpuli and Rajesh Ranganath on this! Check out the paper ( https://t.co/ImUEltgNHJ) and single-file, scalable and pure pytorch implementations for CATs ( https://t.co/NQsQkA8IgA) Excited to see what people build with this!
1
0
2
@bicycleman15
Jatin Prakash
18 days
8/ With a single training run, CATs enable: - smaller KV cache memory for long-context reasoning in increasingly memory bound workloads - longer shelf-life of models by adapting them to be deployed on cheaper hardware - lower compute under increased traffic and much more ...
1
0
1
@bicycleman15
Jatin Prakash
18 days
7/ cont’d ✅ parallel and scalable training (unlike sequential recurrent approaches) ✅ learn end-to-end what to keep in memory (unlike fixed, hueristic attention masks) ✅ BOTH compute & memory savings during generation (unlike NSA)
1
0
1
@bicycleman15
Jatin Prakash
18 days
7/ CATs alleviate number of issues with current efficient approaches, and provide: ✅ native support for adaptivity at test-time ✅ flexible yet efficient memory as sequence grows (unlike linear attention, MegaByte or Block Transformers) (cont’d 👇)
1
0
1
@bicycleman15
Jatin Prakash
18 days
6/ Further, we develop a version of CAT that can be used as a drop-in replacement layer in any architecture 📥 This unlocks lots of interesting possibilities -- starting with natively adaptive hybrid architectures that mix CATs with dense attention, or even linear attention 🧩
1
0
1
@bicycleman15
Jatin Prakash
18 days
5/ This adaptivity is crucial since apriori its not known how much resources a downstream task needs. Writing short emails might not need lots of compute. But code autocomplete needs recalling function names from a long repository, and that may warrant more compute and memory.
1
0
1
@bicycleman15
Jatin Prakash
18 days
4/ Importantly, CATs are trained on multiple chunk sizes at once that unlocks 🔓 the quality-efficiency trade-offs directly at test-time, without any re-training! This enables CATs to adapt to diverse downstream tasks all requiring varying compute-memory budgets ⚖️
1
0
1
@bicycleman15
Jatin Prakash
18 days
3/ Due to compression resulting in a reduced sequence length, compute FLOPs & KV-cache memory diminish by a factor of chunk size! CATs are upto 3x faster 🚀 and 9x memory efficient 📏compared to a dense transformer while matching it on language modeling tasks.
1
0
1
@bicycleman15
Jatin Prakash
18 days
2/ CAT is a conceptually simple architecture that is built with _just_ two simple and well-known ingredients: dense attention and compression. The key idea is to decode chunks of tokens by attending to compressed chunks of the sequence so far. Everything is learnt end-to-end!
1
0
1
@bicycleman15
Jatin Prakash
18 days
New paper alert 🚨 What if I told you there is an architecture that provides a _knob_ to control quality-efficiency trade-offs directly at test-time? Introducing Compress & Attend Transformers (CATs) that provide you exactly this! 🧵(1/n) 👇
1
11
24
@zachary_horvitz
Zachary Horvitz
1 month
✨Masked Diffusion Language Models✨ are great for reasoning, but not just for the reasons you think! Fast parallel decoding? 🤔 Any-order decoding? 🤨 Plot twist: MDLMs offer A LOT MORE for inference and post-training! 🎢🧵
4
36
165
@ADarmouni
Axel Darmouni
2 months
Super cool work by @AnirudhBuvanesh and @bicycleman15 Basically when working for RL on a hard task for which rewards are sparse and SFT is either not an option or not desired You SHOULD NOT use dense rewards Instead, make easier versions of your examples! They tested this on
0
1
3
@bicycleman15
Jatin Prakash
3 months
Enjoyed working with @AnirudhBuvanesh
0
0
2
@bicycleman15
Jatin Prakash
3 months
What to do when you have zero rewards during RL? We benchmarked RL baselines on a simple star-graph task where they underperform in zero reward scenarios. Turns out, a dead simple data-centric intervention of just adding easy samples of the task helps unlock RL training! 👇
@AnirudhBuvanesh
Anirudh Buvanesh
3 months
Zero rewards after tons of RL training? 😞 Before using dense rewards or incentivizing exploration, try changing the data. Adding easier instances of the task can unlock RL training. 🔓📈To know more checkout our blog post here: https://t.co/BPErVcLmP8. Keep reading 🧵(1/n)
1
7
12
@aasthajn2504
Aastha Jain
5 months
Starting something close to my heart. As a woman in male-dominated spaces, I’ve often longed for more women friendships. Moving to a new city and starting a male-heavy degree has made them even harder to find. (1/n)
1
1
1