
Sanjay Subramanian
@sanjayssub
Followers
895
Following
2K
Media
13
Statuses
259
Building/analyzing NLP and vision models. PhD student @berkeley_ai. Formerly: @allen_ai, @penn
Berkeley, CA
Joined September 2019
New paper at #acl2023nlp!."Modular Visual Question Answering via Code Generation".With @medhini_n @kushaltk1248 @KevinYa33964384 @NagraniArsha @CordeliaSchmid @andyzengtweets @trevordarrell Dan Klein (@berkeley_ai/@GoogleAI)!.📜 💻
5
44
152
RT @LakshyAAAgrawal: How does prompt optimization compare to RL algos like GRPO?. GRPO needs 1000s of rollouts, but humans can learn from a….
0
166
0
Blazing-fast image creation – using just your voice. Try Grok Imagine.
331
661
4K
RT @brenthyi: Had so much fun working on this😊. PyTorch and JAX implementations are both out!.
0
8
0
RT @ruilong_li: For everyone interested in precise 📷camera control 📷 in transformers [e.g., video / world model etc]. Stop settling for Plü….
0
81
0
RT @baifeng_shi: Understanding a video involves both short-range and long-range understanding. Short-range understanding is more about "mo….
0
12
0
RT @realJessyLin: User simulators bridge RL with real-world interaction //. How do we get the RL paradigm to work….
0
46
0
RT @YutongBAI1002: What would a World Model look like if we start from a real embodied agent acting in the real world?. It has to have: 1)….
0
131
0
This repo is based heavily on big_vision ❤️, and the main additions so far are support for more sharding types, ring/flash attention, and a different architecture (LLaVA OneVision/Video).
1
0
2
Finally, some collaborators and I have been working on a repo for running inference and fine-tuning on video LMs in JAX, and I hope it can be useful to many others: Hope to improve it over time, please let me know if you have issues or want other features!.
github.com
Run Inference/Finetuning on large Video LMs in JAX - sanjayss34/big_video_lm
2
0
1
Also be sure to check out this awesome work on automated slide generation led by @aomaru_21490 and @ZhiruoW on Friday at Poster Session 1 - ExHall D #262.
Introducing "AutoPresent: Designing Structured Visuals From Scratch". We employ code generation to create structured, high-quality presentation slides from scratch!.📄 🤗 🔗 .@berkeley_ai @LTIatCMU
1
1
3
Excited to be at CVPR! Check out our work on using VLMs for pose estimation on Friday at Poster Session 2 - ExHall D #169. #CVPR2025.
Excited to share some recent work!. "Pose Priors from Language Models". We show how to use multimodal LMs to improve 3D human pose estimates in situations with physical contact. Joint work w/ Evonne Ng , @LeaMue27 , Dan Klein (@BerkeleyNLP), @shiryginosar , @trevordarrell
2
0
12
RT @Ritwik_G: Ever wondered if the way we feed image patches to vision models is the best way? The standard row-by-row scan isn't always op….
0
33
0
RT @ZhongRuiqi: Last day of PhD! . I pioneered using LLMs to explain dataset&model. It's used by interp at @OpenAI and societal impact @An….
0
38
0
RT @NickATomlin: The long-term goal of AI is to build models that can handle arbitrary tasks, not just ones they’ve been trained on. We hop….
0
30
0
RT @jiayi_pirate: We explore a new dimension in scaling reasoning models in Adaptive Parallel Reasoning. APR lets LMs learn to orchestrate….
0
73
0
RT @KushtimusPrime: NeRFs and Gaussian Splats excel at static 3D modeling but robots work in dynamic, unpredictable environments. POGS (Per….
0
17
0
RT @baifeng_shi: Next-gen vision pre-trained models shouldn’t be short-sighted. Humans can easily perceive 10K x 10K resolution. But today….
0
153
0
RT @ZinengTang: We are thrilled to announce TULIP!. 🌷 A state of the vision language encoders coupled with generat….
0
69
0
RT @enfleisig: How does model calibration stand up against humans? We ran live competitions, comparing model and human calibration, to crea….
0
3
0
RT @aomaru_21490: Introducing "AutoPresent: Designing Structured Visuals From Scratch". We employ code generation to create structured, hig….
0
69
0
RT @LeaMue27: - Humans and Structure from Motion -. We jointly reconstruct 3D humans, scene point cloud, and cameras from images captured w….
0
65
0