gaur_manu Profile Banner
Manu Gaur Profile
Manu Gaur

@gaur_manu

Followers
323
Following
19K
Media
106
Statuses
2K

used to do physics, now multiplying matrices @IIIT_Hyderabad | Incoming @CMU_Robotics

New Delhi, India
Joined May 2012
Don't wanna be here? Send us removal request.
@gaur_manu
Manu Gaur
9 months
Can RL fine-tuning endow MLLMs with fine-grained visual understanding?. Using our training recipe, we outperform SOTA open-source MLLMs on fine-grained visual discrimination with ClipCap, a mere 200M param simplification of modern MLLMs!!!. 🚨Introducing No Detail Left Behind:
Tweet media one
4
30
128
@gaur_manu
Manu Gaur
20 hours
Everyone asks who is adam but not how is adam 😢.
@2prime_PKU
Yiping Lu
1 day
Anyone knows adam?
Tweet media one
1
0
4
@gaur_manu
Manu Gaur
3 days
Great research work. The thread is a gold mine for anyone interested in understanding diffusion language modelling and how it fares with AR models!.
@mihirp98
Mihir Prabhudesai
3 days
🚨 The era of infinite internet data is ending, So we ask:. 👉 What’s the right generative modelling objective when data—not compute—is the bottleneck?. TL;DR:. ▶️Compute-constrained? Train Autoregressive models. ▶️Data-constrained? Train Diffusion models. Get ready for 🤿 1/n
Tweet media one
0
1
7
@gaur_manu
Manu Gaur
10 days
Yup. the linear layer can reconstruct using the residual stream as long as the image is scaled. It works even if you initialize siglip with random weights :
Tweet media one
@cloneofsimo
Simo Ryu
26 days
Nothing special here: this is always case for randomly initialized clip due to pre-norm nature. For siglip, your size of residual stream is proportional to magnitude of initial emb, but its not the case for activations on MLP / attention due to normalizations.(Say z <- z +
Tweet media one
0
0
4
@gaur_manu
Manu Gaur
22 days
Moving beyond MCQ to tasks that evaluate free-form generation is crucial to develop systems that better understand instructions and leverage EXISTING knowledge more effectively. From my work - gemini knows the prominent point of difference (aces VQA), but fails to independently
Tweet media one
@gaur_manu
Manu Gaur
22 days
MCQ is great for checking existence of specific knowledge i.e if model fails to answer, it definitely lacks it. However, providing the answer along with the task prompt biases model's output towards the very concept that is being evaluated. This raises questions about whether the.
0
0
4
@gaur_manu
Manu Gaur
22 days
"On MMMU Pro , a visual question-answering benchmark with 10 choices, we obtain 51% shortcut-accuracy without showing the image or the question". Cambrian did show language shortcuts made by MLLMs on popular VQA datasets, but shortcuts using just the multiple choices is insane!
Tweet media one
@ShashwatGoel7
Shashwat Goel
22 days
There's been a hole at the heart of #LLM evals, and we can now fix it. 📜New paper: Answer Matching Outperforms Multiple Choice for Language Model Evaluations. ❗️We found MCQs can be solved without even knowing the question. Looking at just the choices helps guess the answer
Tweet media one
1
3
13
@gaur_manu
Manu Gaur
22 days
MCQ is great for checking existence of specific knowledge i.e if model fails to answer, it definitely lacks it. However, providing the answer along with the task prompt biases model's output towards the very concept that is being evaluated. This raises questions about whether the.
@ShashwatGoel7
Shashwat Goel
22 days
There's been a hole at the heart of #LLM evals, and we can now fix it. 📜New paper: Answer Matching Outperforms Multiple Choice for Language Model Evaluations. ❗️We found MCQs can be solved without even knowing the question. Looking at just the choices helps guess the answer
Tweet media one
0
3
7
@gaur_manu
Manu Gaur
1 month
feeling dumb that I never thought of it this way. makes total sense, a linear classifier learns the "ideal" vector W_j for each class. with CLIP, we can simply replace the learnt W_j with text embeddings - so the text encoder effectively is a hypernetwork.
Tweet media one
1
0
7
@gaur_manu
Manu Gaur
1 month
huh, didn't see that coming
Tweet media one
0
0
2
@gaur_manu
Manu Gaur
1 month
the fomo is very real for those outside the frontier labs. Curiosity driven research remains a healthy escape (for me at least)— feynman style, detached from the outcomes and pursued solely for the love of the game. Whether or not I succeed, I’d certainly enjoy the ride.
@sainingxie
Saining Xie
1 month
@SuvanshSanjeev lol thanks for saying that! I think we’re both coming from the same place-wanting to encourage others. I’ve seen a lot of phd students feeling anxious, struggling with fomo and wondering if they’re making a huge life mistake. but I really believe we can create a culture where.
2
0
4
@gaur_manu
Manu Gaur
1 month
This is who runs this account
Tweet media one
@DavidSHolz
David
1 month
This is who runs this account
Tweet media one
0
0
3
@gaur_manu
Manu Gaur
1 month
lance armstrong's favourite policy gradient method!.
@wang_jianren
Jianren Wang
1 month
(1/n) Since its publication in 2017, PPO has essentially become synonymous with RL. Today, we are excited to provide you with a better alternative - EPO.
0
0
1
@gaur_manu
Manu Gaur
1 month
“there was no importance to what I was doing, but ultimately there was” :)
Tweet media one
@npparikh
Neal Parikh
1 month
He was at Cornell I think and burned out and he started working on the physics of the plates wobbling on those cafeteria things that stack the plates. Completely pointless. Just fun easy physics for him and it unstuck him.
1
2
4
@gaur_manu
Manu Gaur
1 month
you can take a man out of physics, but you can't take the physics out of the man 😉.great talk by kaiming!
Tweet media one
0
0
1
@gaur_manu
Manu Gaur
1 month
so excited for this one!!
Tweet media one
1
0
6
@gaur_manu
Manu Gaur
1 month
"there are cathedrals everywhere for those with the eyes to see".
@giffmana
Lucas Beyer (bl16)
1 month
Only on HuggingFace:
Tweet media one
0
0
2
@gaur_manu
Manu Gaur
1 month
what pytorch hell am I in. adding print statements to the forward() doubles my training speed 😭😭 (31 iters/s jumps to 73)
Tweet media one
Tweet media two
Tweet media three
6
1
58
@gaur_manu
Manu Gaur
2 months
really neat (also cheap) solution to an important problem!.
@_AmilDravid
Amil Dravid
2 months
We can then mimic the effect of learned register tokens by just shifting the activations arising from the register neurons to a dummy token during the forward pass.
Tweet media one
0
0
1
@gaur_manu
Manu Gaur
2 months
really cool work. post-training image generators with VLMs seems to be an obvious way of enforcing multimodal control. however its expensive, instead you can simply use the gradient of the reward model to update the generator at test time. also highly efficient as you don’t need.
@graceluo_
Grace Luo
2 months
✨New preprint: Dual-Process Image Generation! We distill *feedback from a VLM* into *feed-forward image generation*, at inference time. The result is flexible control: parameterize tasks as multimodal inputs, visually inspect the images with the VLM, and update the generator.🧵
0
0
2
@gaur_manu
Manu Gaur
2 months
a good prior goes a long way for controllable generations. Cool work by @zeeshank95!.
@zeeshank95
Zeeshan khan
2 months
Can text-to-image Diffusion models handle surreal compositions beyond their training distribution?. 🚨 Introducing ComposeAnything — Composite object priors for diffusion models .📸 More faithful, controllable generations — no retraining required. 🔗1/9
Tweet media one
0
0
0
@gaur_manu
Manu Gaur
2 months
important read >> huge gains or measurement errors??.
@ShashwatGoel7
Shashwat Goel
2 months
Confused about recent LLM RL results where models improve without any ground-truth signal? We were too. Until we looked at the reported numbers of the Pre-RL models and realized they were serverely underreported across papers. We compiled discrepancies in a blog below🧵👇
Tweet media one
0
0
5