Sanjay Subramanian Profile
Sanjay Subramanian

@sanjayssub

Followers
910
Following
2K
Media
13
Statuses
272

Building/analyzing NLP and vision models. PhD student @berkeley_ai. Formerly: @allen_ai, @penn

Berkeley, CA
Joined September 2019
Don't wanna be here? Send us removal request.
@sanjayssub
Sanjay Subramanian
3 years
5
44
151
@NickATomlin
Nicholas Tomlin
4 days
A couple years (!) in the making: we’re releasing a new corpus of embodied, collaborative problem solving dialogues. We paid 36 people to play Portal 2’s co-op mode and collected their speech + game recordings Paper: https://t.co/EHB4lbR7Ax Website: https://t.co/FK7tTFuQLt
3
23
71
@XDWang101
XuDong Wang
19 days
Objectness should be user-defined — not human-label-defined! Unsupervised SAM 2 (UnSAMv2) makes it real✨ 1 point + a continuous granularity slider = the mask you want! UnSAMv2 beats SAM2: +16% NoC-90, +26% 1-IoU, +37% AR on 11+ datasets (w/ just 6k unlabeled images)!💪 1/n
1
10
17
@Ritwik_G
Ritwik Gupta 🇺🇦
28 days
I am recruiting Ph.D. students at @umdcs starting Fall 2026! I am looking for students in three broad areas: (1) Physics-integrated computer vision (2) VLMs with constraints (2) Dual-use AI policy We're ranked #3 in AI on @CSrankings! Specific details in 🧵
12
106
406
@ameeshsh
Ameesh Shah
1 month
LLMs have shown a remarkable ability to “self-refine” and learn from their mistakes via in-context learning. But in robotics, most methods are single-shot. How can we bring inference-time adaptation to robot learning? A 🧵:
10
18
130
@BaruaJosh
Josh Barua
1 month
🌍 LLMs can use long chain-of-thought (CoT) to reason in English, but what about other languages? New paper w/ @BerkeleyNLP: We study how scaling, pretraining, post-training & inference affect long CoT across 9 languages. Spoiler: English long CoT ≠ multilingual long CoT 🧵
4
8
22
@tsunghan_wu
Tsung-Han (Patrick) Wu @ NeurIPS 2025
2 months
Humans handle dynamic situations easily, what about models? Turns out, they break in three distinct ways: ⛔ Force Stop → Reasoning leakage (won’t stop) ⚡️ Speedup → Panic (rushed answers) ❓ Info Updates → Self-doubt (reject updates) 👉Check out https://t.co/wKrnsMkiFY
5
21
69
@aomaru_21490
Jiaxin Ge @ Neurips
2 months
✨Introducing ECHO, the newest in-the-wild image generation benchmark! You’ve seen new image models and new use cases discussed on social media, but old benchmarks don’t test them! We distilled this qualitative discussion into a structured benchmark. 🔗 https://t.co/wJmmEY8TFQ
3
32
117
@baifeng_shi
Baifeng
2 months
Generalization is the biggest problem for robotics right now. This includes generalization to unseen objects, environments, tasks… Our recent work shows that generalization to novel objects might not be *that* hard. Specifically, we show that robots, trained on **randomly
2
8
28
@xkungfu
Stephanie Fu
2 months
I’m at #COLM2025 🇨🇦presenting “Hidden in Plain Sight: VLMs overlook their vision representations” as a Poster and Oral! Also honored to win Outstanding Paper here and Best Paper @ CVPR EVAL-FoMo 2! Come chat at poster 12 (Wed AM) about building perceptual representations! (1/3)
3
11
62
@ethanjohnweber
Ethan Weber
2 months
📢 SceneComp @ ICCV 2025 🏝️ 🌎 Generative Scene Completion for Immersive Worlds 🛠️ Reconstruct what you know AND 🪄 Generate what you don’t! 🙌 Meet our speakers @angelaqdai, @holynski_, @jampani_varun, @ZGojcic @taiyasaki, Peter Kontschieder https://t.co/LvONYIK3dz #ICCV2025
2
17
55
@justkerrding
Justin Kerr
3 months
Should robots have eyeballs? Human eyes move constantly and use variable resolution to actively gather visual details. In EyeRobot ( https://t.co/iSL7ZLZcHu) we train a robot eyeball entirely with RL: eye movements emerge from experience driven by task-driven rewards.
8
56
272
@XDWang101
XuDong Wang
3 months
🎉 Excited to share RecA: Reconstruction Alignment Improves Unified Multimodal Models 🔥 Post-train w/ RecA: 8k images & 4 hours (8 GPUs) → SOTA UMMs: GenEval 0.73→0.90 | DPGBench 80.93→88.15 | ImgEdit 3.38→3.75 Code: https://t.co/yFEvJ0Algw 1/n
6
31
80
@LakshyAAAgrawal
Lakshya A Agrawal @ NeurIPS
4 months
How does prompt optimization compare to RL algos like GRPO? GRPO needs 1000s of rollouts, but humans can learn from a few trials—by reflecting on what worked & what didn't. Meet GEPA: a reflective prompt optimizer that can outperform GRPO by up to 20% with 35x fewer rollouts!🧵
47
168
1K
@brenthyi
Brent Yi
5 months
Had so much fun working on this😊 PyTorch and JAX implementations are both out!
@ruilong_li
Ruilong Li
5 months
For everyone interested in precise 📷camera control 📷 in transformers [e.g., video / world model etc] Stop settling for Plücker raymaps -- use camera-aware relative PE in your attention layers, like RoPE (for LLMs) but for cameras! Paper & code: https://t.co/HPW7moJuvW
0
8
67
@ruilong_li
Ruilong Li
5 months
For everyone interested in precise 📷camera control 📷 in transformers [e.g., video / world model etc] Stop settling for Plücker raymaps -- use camera-aware relative PE in your attention layers, like RoPE (for LLMs) but for cameras! Paper & code: https://t.co/HPW7moJuvW
10
97
535
@baifeng_shi
Baifeng
5 months
Understanding a video involves both short-range and long-range understanding. Short-range understanding is more about "motion" and requires system-1 perception. Long-range understanding is more system-2, and requires memory, reasoning, etc. Both have huge room for improvement.
@yukangchen_
Yukang Chen
5 months
Video understanding isn't just recognizing —it demands reasoning across thousands of frames. Meet Long-RL🚀 Highlights: 🧠 Dataset: LongVideo-Reason — 52K QAs with reasoning. ⚡ System: MR-SP - 2.1× faster RL for long videos. 📈 Scalability: Hour-long videos (3,600 frames) RL
1
11
77
@realJessyLin
Jessy Lin
5 months
User simulators bridge RL with real-world interaction // https://t.co/bsrYxVHuVo How do we get the RL paradigm to work on tasks beyond math & code? Instead of designing datasets, RL requires designing environments. Given that most non-trivial real-world tasks involve
10
50
341
@YutongBAI1002
Yutong Bai @ NeurIPS
6 months
What would a World Model look like if we start from a real embodied agent acting in the real world? It has to have: 1) A real, physically grounded and complex action space—not just abstract control signals. 2) Diverse, real-life scenarios and activities. Or in short: It has to
32
141
538
@sanjayssub
Sanjay Subramanian
6 months
This repo is based heavily on big_vision https://t.co/GeXBU7YZDe ❤️, and the main additions so far are support for more sharding types, ring/flash attention, and a different architecture (LLaVA OneVision/Video)
1
0
2
@sanjayssub
Sanjay Subramanian
6 months
Finally, some collaborators and I have been working on a repo for running inference and fine-tuning on video LMs in JAX, and I hope it can be useful to many others: https://t.co/g1n2TmoKzV Hope to improve it over time, please let me know if you have issues or want other features!
Tweet card summary image
github.com
Run Inference/Finetuning on large Video LMs in JAX - sanjayss34/big_video_lm
2
0
1