Ayush Jain
@ayushjain1144
Followers
418
Following
3K
Media
20
Statuses
214
Robotics PhD Student, CMU | MS in Robotics, CMU | B.E. CS, BITS Pilani | 🇮🇳
Pittsburgh, PA
Joined May 2018
1/ Despite having access to rich 3D inputs, embodied agents still rely on 2D VLMs—due to the lack of large-scale 3D data and pre-trained 3D encoders. We introduce UniVLG, a unified 2D-3D VLM that leverages 2D scale to improve 3D scene understanding. https://t.co/DGGtYYPaQi
1
28
136
It looks like @CVPR has implemented a new mandatory "Compute Reporting Form" that must be submitted alongside any paper submission. Though I am sympathetic to the motivations for this change, I am opposed to it for a variety of reasons:
3
32
225
Happy to be on this list! 🙂
There’s no conference without the efforts of our reviewers. Special shoutout to our #ICCV2025 outstanding reviewers 🫡 https://t.co/WYAcXLRXla
0
0
11
Meet MapAnything – a transformer that directly regresses factored metric 3D scene geometry (from images, calibration, poses, or depth) in an end-to-end way. No pipelines, no extra stages. Just 3D geometry & cameras, straight from any type of input, delivering new state-of-the-art
29
129
722
Checkout this amazing new work from @yehonation!
#ICCV2025 Introducingđź’ˇLightSwitchđź’ˇ- A multi-view material-relighting diffusion pipeline that directly and efficiently relights any number of input images to a target lighting & do 3D asset relighting with gaussian splatting! đź§µ
1
0
3
In RENT, we showed LLMs can improve without access to answers - by maximizing confidence. In this work, we go further: LLMs can improve without even having the questions. Using self-play, one LLM learns to ask challenging questions, while other LLM uses confidence to solve them
Self-Questioning Language Models: LLMs that learn to generate their own questions and answers via asymmetric self-play RL. There is no external training data – the only input is a single prompt specifying the topic.
0
5
21
Couldn’t be at #ACL2025NLP, but check out our ACL paper from @MSFTResearch! We study how implicit cues in video demos (eye gaze & speech) impact personalized assistance in VLMs. TL;DR: - RGB + gaze > RGB alone - Gaze vs. speech impact is task-specific 📄 https://t.co/r9WMVidmaC
7
9
67
I'm observing a mini Moravec's paradox within robotics: gymnastics that are difficult for humans are much easier for robots than "unsexy" tasks like cooking, cleaning, and assembling. It leads to a cognitive dissonance for people outside the field, "so, robots can parkour &
145
615
3K
Great work from Mihir with lots of nice insights in the thread!
🚨 The era of infinite internet data is ending, So we ask: 👉 What’s the right generative modelling objective when data—not compute—is the bottleneck? TL;DR: ▶️Compute-constrained? Train Autoregressive models ▶️Data-constrained? Train Diffusion models Get ready for 🤿 1/n
0
1
7
We are HALFWAY there! Thanks to all those who've kindly contributed 🙏🙏 With Indaba <4 weeks away, let's send all the 25 African researchers to their dream conference! Donate what you can: https://t.co/ryCItIoxNs
The opportunity gap in AI is more striking than ever. We talk way too much about those receiving $100M or whatever for their jobs, but not enough those asking for <$1k to present their work. For 3rd year in a row, @ml_collective is raising funds to support @DeepIndaba attendees.
1
54
76
@svlevine Good article. I have three comments: 1. With any hard optimization problem, if you can get into the right ballpark, you save a lot of time searching around. I think that's where human demonstration really helps. 2. When a human watches Roger Federer, they get the gist of what
2
4
47
Happening now!
On #ICML2025 16 Jul, 11 AM We present Meta Locate 3D: a model for accurate object localization in 3D environments. Meta Locate 3D can help robots accurately understand their surroundings and interact more naturally with humans. Demo, model, paper: https://t.co/8ZhV21TDxq
0
0
6
Happening right now!!
Can we train a 3D-language multimodality Transformer using 2D VLMs and rendering loss? @iamsashasax will present our new #icml25 paper on Wednesday 2pm at Hall B2-B3 W200. Please come and check! Project Page: https://t.co/MVX6EvS4t4
0
0
5
Can we train a 3D-language multimodality Transformer using 2D VLMs and rendering loss? @iamsashasax will present our new #icml25 paper on Wednesday 2pm at Hall B2-B3 W200. Please come and check! Project Page: https://t.co/MVX6EvS4t4
0
21
133
Becoming an RL diehard in the past year and thinking about RL for most of my waking hours inadvertently taught me an important lesson about how to live my own life. One of the big concepts in RL is that you always want to be “on-policy”: instead of mimicking other people’s
127
347
3K
On #ICML2025 16 Jul, 11 AM We present Meta Locate 3D: a model for accurate object localization in 3D environments. Meta Locate 3D can help robots accurately understand their surroundings and interact more naturally with humans. Demo, model, paper: https://t.co/8ZhV21TDxq
5
15
54
I'll be at #ICML2025 to present UniVLG! Excited to meet old friends and make new ones, especially people working in the Indian research ecosystem. Feel free to reach out if you would like to chat!
1/ Despite having access to rich 3D inputs, embodied agents still rely on 2D VLMs—due to the lack of large-scale 3D data and pre-trained 3D encoders. We introduce UniVLG, a unified 2D-3D VLM that leverages 2D scale to improve 3D scene understanding. https://t.co/DGGtYYPaQi
0
1
15
The opportunity gap in AI is more striking than ever. We talk way too much about those receiving $100M or whatever for their jobs, but not enough those asking for <$1k to present their work. For 3rd year in a row, @ml_collective is raising funds to support @DeepIndaba attendees.
16
120
236
We have an open position at Apple MLR to work scalable and efficient generative models that perform across diverse data domains—including images, 3D, video, graphs, etc. We care deeply about simplifying modeling pipelines, developing powerful and scalable training recipes.
2
14
65
AllTracker: Efficient Dense Point Tracking at High Resolution If you're using any point tracker in any project, this is likely a drop-in upgrade—improving speed, accuracy, and density, all at once.
2
38
240