Jatin Prakash @bicycleman15 X Profile

Jatin Prakash

@bicycleman15

Followers

81

Following

458

Media

5

Statuses

29

phd'ing @NYU_Courant | previously @MSFTResearch @iitdelhi

https://t.co/EXsxY0Zxll

New Delhi, India

Joined June 2021

Don't wanna be here? Send us removal request.

Jatin Prakash

@bicycleman15

4 days

i'd deploy model X does it matter then that model X has more parameters than model Y? is this a fair comparison b/w model X and Y? if not, what is the "fair" comparison? (2/2 🧵)

0

Jatin Prakash

@bicycleman15

4 days

A question for you guys 🤔 Suppose: - model X has more parameters than model Y - model X outperforms model Y in quality - model X uses lower inference costs than model Y (both compute and memory usage) Which model would you deploy for your startup: model X or Y? (1/2 🧵)

1

0

fly51fly

@fly51fly

18 days

[LG] Attention and Compression is all you need for Controllably Efficient Language Models J Prakash, A Puli, R Ranganath [New York University] (2025) https://t.co/Ht7cSxXPcd

1

11

42

Jatin Prakash

@bicycleman15

18 days

On a personal note: - I was quite surprised by how much information can be compressed and decoded accurately, but ONLY if sufficient compute is used - Using more parameters can result in more quality AND efficiency, and compression is a way to unlock this interesting interplay

Jatin Prakash

@bicycleman15

18 days

New paper alert 🚨 What if I told you there is an architecture that provides a _knob_ to control quality-efficiency trade-offs directly at test-time? Introducing Compress & Attend Transformers (CATs) that provide you exactly this! 🧵(1/n) 👇

0

4

Jatin Prakash

@bicycleman15

18 days

11/ Finally I would like to thank some works where I got some inspiration for CATs, specially: - Idea of Matryoshka by @adityakusupati (hot take of mine: if there is a hyper-parameter that gives some trade-off, you can probably "Matryoshka" it 🪆) - and FlexiViTs by @giffmana

0

2

Jatin Prakash

@bicycleman15

18 days

10/ It was great working with @aahladpuli and Rajesh Ranganath on this! Check out the paper ( https://t.co/ImUEltgNHJ) and single-file, scalable and pure pytorch implementations for CATs ( https://t.co/NQsQkA8IgA) Excited to see what people build with this!

1

0

2

Jatin Prakash

@bicycleman15

18 days

8/ With a single training run, CATs enable: - smaller KV cache memory for long-context reasoning in increasingly memory bound workloads - longer shelf-life of models by adapting them to be deployed on cheaper hardware - lower compute under increased traffic and much more ...

1

0

1

Jatin Prakash

@bicycleman15

18 days

7/ cont’d ✅ parallel and scalable training (unlike sequential recurrent approaches) ✅ learn end-to-end what to keep in memory (unlike fixed, hueristic attention masks) ✅ BOTH compute & memory savings during generation (unlike NSA)

1

0

1

Jatin Prakash

@bicycleman15

18 days

7/ CATs alleviate number of issues with current efficient approaches, and provide: ✅ native support for adaptivity at test-time ✅ flexible yet efficient memory as sequence grows (unlike linear attention, MegaByte or Block Transformers) (cont’d 👇)

1

0

1

Jatin Prakash

@bicycleman15

18 days

6/ Further, we develop a version of CAT that can be used as a drop-in replacement layer in any architecture 📥 This unlocks lots of interesting possibilities -- starting with natively adaptive hybrid architectures that mix CATs with dense attention, or even linear attention 🧩

1

0

1

Jatin Prakash

@bicycleman15

18 days

5/ This adaptivity is crucial since apriori its not known how much resources a downstream task needs. Writing short emails might not need lots of compute. But code autocomplete needs recalling function names from a long repository, and that may warrant more compute and memory.

1

0

1

Jatin Prakash

@bicycleman15

18 days

4/ Importantly, CATs are trained on multiple chunk sizes at once that unlocks 🔓 the quality-efficiency trade-offs directly at test-time, without any re-training! This enables CATs to adapt to diverse downstream tasks all requiring varying compute-memory budgets ⚖️

1

0

1

Jatin Prakash

@bicycleman15

18 days

3/ Due to compression resulting in a reduced sequence length, compute FLOPs & KV-cache memory diminish by a factor of chunk size! CATs are upto 3x faster 🚀 and 9x memory efficient 📏compared to a dense transformer while matching it on language modeling tasks.

1

0

1

Jatin Prakash

@bicycleman15

18 days

2/ CAT is a conceptually simple architecture that is built with _just_ two simple and well-known ingredients: dense attention and compression. The key idea is to decode chunks of tokens by attending to compressed chunks of the sequence so far. Everything is learnt end-to-end!

1

0

1

Jatin Prakash

@bicycleman15

18 days

New paper alert 🚨 What if I told you there is an architecture that provides a _knob_ to control quality-efficiency trade-offs directly at test-time? Introducing Compress & Attend Transformers (CATs) that provide you exactly this! 🧵(1/n) 👇

1

11

24

Zachary Horvitz

@zachary_horvitz

1 month

✨Masked Diffusion Language Models✨ are great for reasoning, but not just for the reasons you think! Fast parallel decoding? 🤔 Any-order decoding? 🤨 Plot twist: MDLMs offer A LOT MORE for inference and post-training! 🎢🧵

4

36

165

Axel Darmouni

@ADarmouni

2 months

Super cool work by @AnirudhBuvanesh and @bicycleman15 Basically when working for RL on a hard task for which rewards are sparse and SFT is either not an option or not desired You SHOULD NOT use dense rewards Instead, make easier versions of your examples! They tested this on

0

1

3

Jatin Prakash

@bicycleman15

3 months

Enjoyed working with @AnirudhBuvanesh

0

2

Jatin Prakash

@bicycleman15

3 months

What to do when you have zero rewards during RL? We benchmarked RL baselines on a simple star-graph task where they underperform in zero reward scenarios. Turns out, a dead simple data-centric intervention of just adding easy samples of the task helps unlock RL training! 👇

Anirudh Buvanesh

@AnirudhBuvanesh

3 months

Zero rewards after tons of RL training? 😞 Before using dense rewards or incentivizing exploration, try changing the data. Adding easier instances of the task can unlock RL training. 🔓📈To know more checkout our blog post here: https://t.co/BPErVcLmP8. Keep reading 🧵(1/n)

1

7

12

Aastha Jain

@aasthajn2504

5 months

Starting something close to my heart. As a woman in male-dominated spaces, I’ve often longed for more women friendships. Moving to a new city and starting a male-heavy degree has made them even harder to find. (1/n)

1