Sanjay Subramanian
@sanjayssub
Followers
910
Following
2K
Media
13
Statuses
272
Building/analyzing NLP and vision models. PhD student @berkeley_ai. Formerly: @allen_ai, @penn
Berkeley, CA
Joined September 2019
New paper at #acl2023nlp! "Modular Visual Question Answering via Code Generation" With @medhini_n @kushaltk1248 @KevinYa33964384 @NagraniArsha @CordeliaSchmid @andyzengtweets @trevordarrell Dan Klein (@berkeley_ai/@GoogleAI)! 📜 https://t.co/O4jJDc4prj 💻 https://t.co/kwGyYMJ8le
5
44
151
A couple years (!) in the making: we’re releasing a new corpus of embodied, collaborative problem solving dialogues. We paid 36 people to play Portal 2’s co-op mode and collected their speech + game recordings Paper: https://t.co/EHB4lbR7Ax Website: https://t.co/FK7tTFuQLt
3
23
71
Objectness should be user-defined — not human-label-defined! Unsupervised SAM 2 (UnSAMv2) makes it real✨ 1 point + a continuous granularity slider = the mask you want! UnSAMv2 beats SAM2: +16% NoC-90, +26% 1-IoU, +37% AR on 11+ datasets (w/ just 6k unlabeled images)!💪 1/n
1
10
17
I am recruiting Ph.D. students at @umdcs starting Fall 2026! I am looking for students in three broad areas: (1) Physics-integrated computer vision (2) VLMs with constraints (2) Dual-use AI policy We're ranked #3 in AI on @CSrankings! Specific details in 🧵
12
106
406
LLMs have shown a remarkable ability to “self-refine” and learn from their mistakes via in-context learning. But in robotics, most methods are single-shot. How can we bring inference-time adaptation to robot learning? A 🧵:
10
18
130
🌍 LLMs can use long chain-of-thought (CoT) to reason in English, but what about other languages? New paper w/ @BerkeleyNLP: We study how scaling, pretraining, post-training & inference affect long CoT across 9 languages. Spoiler: English long CoT ≠ multilingual long CoT 🧵
4
8
22
Humans handle dynamic situations easily, what about models? Turns out, they break in three distinct ways: ⛔ Force Stop → Reasoning leakage (won’t stop) ⚡️ Speedup → Panic (rushed answers) ❓ Info Updates → Self-doubt (reject updates) 👉Check out https://t.co/wKrnsMkiFY
5
21
69
✨Introducing ECHO, the newest in-the-wild image generation benchmark! You’ve seen new image models and new use cases discussed on social media, but old benchmarks don’t test them! We distilled this qualitative discussion into a structured benchmark. 🔗 https://t.co/wJmmEY8TFQ
3
32
117
Generalization is the biggest problem for robotics right now. This includes generalization to unseen objects, environments, tasks… Our recent work shows that generalization to novel objects might not be *that* hard. Specifically, we show that robots, trained on **randomly
2
8
28
I’m at #COLM2025 🇨🇦presenting “Hidden in Plain Sight: VLMs overlook their vision representations” as a Poster and Oral! Also honored to win Outstanding Paper here and Best Paper @ CVPR EVAL-FoMo 2! Come chat at poster 12 (Wed AM) about building perceptual representations! (1/3)
3
11
62
📢 SceneComp @ ICCV 2025 🏝️ 🌎 Generative Scene Completion for Immersive Worlds 🛠️ Reconstruct what you know AND 🪄 Generate what you don’t! 🙌 Meet our speakers @angelaqdai, @holynski_, @jampani_varun, @ZGojcic @taiyasaki, Peter Kontschieder https://t.co/LvONYIK3dz
#ICCV2025
2
17
55
Should robots have eyeballs? Human eyes move constantly and use variable resolution to actively gather visual details. In EyeRobot ( https://t.co/iSL7ZLZcHu) we train a robot eyeball entirely with RL: eye movements emerge from experience driven by task-driven rewards.
8
56
272
🎉 Excited to share RecA: Reconstruction Alignment Improves Unified Multimodal Models 🔥 Post-train w/ RecA: 8k images & 4 hours (8 GPUs) → SOTA UMMs: GenEval 0.73→0.90 | DPGBench 80.93→88.15 | ImgEdit 3.38→3.75 Code: https://t.co/yFEvJ0Algw 1/n
6
31
80
How does prompt optimization compare to RL algos like GRPO? GRPO needs 1000s of rollouts, but humans can learn from a few trials—by reflecting on what worked & what didn't. Meet GEPA: a reflective prompt optimizer that can outperform GRPO by up to 20% with 35x fewer rollouts!🧵
47
168
1K
Had so much fun working on this😊 PyTorch and JAX implementations are both out!
For everyone interested in precise 📷camera control 📷 in transformers [e.g., video / world model etc] Stop settling for Plücker raymaps -- use camera-aware relative PE in your attention layers, like RoPE (for LLMs) but for cameras! Paper & code: https://t.co/HPW7moJuvW
0
8
67
For everyone interested in precise 📷camera control 📷 in transformers [e.g., video / world model etc] Stop settling for Plücker raymaps -- use camera-aware relative PE in your attention layers, like RoPE (for LLMs) but for cameras! Paper & code: https://t.co/HPW7moJuvW
10
97
535
Understanding a video involves both short-range and long-range understanding. Short-range understanding is more about "motion" and requires system-1 perception. Long-range understanding is more system-2, and requires memory, reasoning, etc. Both have huge room for improvement.
Video understanding isn't just recognizing —it demands reasoning across thousands of frames. Meet Long-RL🚀 Highlights: 🧠 Dataset: LongVideo-Reason — 52K QAs with reasoning. ⚡ System: MR-SP - 2.1× faster RL for long videos. 📈 Scalability: Hour-long videos (3,600 frames) RL
1
11
77
User simulators bridge RL with real-world interaction // https://t.co/bsrYxVHuVo How do we get the RL paradigm to work on tasks beyond math & code? Instead of designing datasets, RL requires designing environments. Given that most non-trivial real-world tasks involve
10
50
341
What would a World Model look like if we start from a real embodied agent acting in the real world? It has to have: 1) A real, physically grounded and complex action space—not just abstract control signals. 2) Diverse, real-life scenarios and activities. Or in short: It has to
32
141
538
This repo is based heavily on big_vision https://t.co/GeXBU7YZDe ❤️, and the main additions so far are support for more sharding types, ring/flash attention, and a different architecture (LLaVA OneVision/Video)
1
0
2
Finally, some collaborators and I have been working on a repo for running inference and fine-tuning on video LMs in JAX, and I hope it can be useful to many others: https://t.co/g1n2TmoKzV Hope to improve it over time, please let me know if you have issues or want other features!
github.com
Run Inference/Finetuning on large Video LMs in JAX - sanjayss34/big_video_lm
2
0
1