Junlin (Hans) Han
@han_junlin
Followers
644
Following
2K
Media
21
Statuses
204
AI research. PhD student at Meta @AIatMeta and Oxford @OxfordTVG
London, England
Joined July 2020
Excited to share our new work: “Learning to See Before Seeing”! 🧠➡️👀 We investigate an interesting phenomeno: how do LLMs, trained only on text, learn about the visual world? Project page: https://t.co/9mQt3qnckL
7
24
149
Congrats to the team! Such a simple yet graceful way to unlock the latent multimodal capacities of text-trained LLMs. The idea of using sensory prompts to actively steer representation alignment is very cool—both practically useful and conceptually deep.
LLMs, trained only on text, might already know more about other modalities than we realized; we just need to find ways elicit it. project page: https://t.co/8cIf1DW0OQ w/ @phillip_isola and @thisismyhat
1
0
10
I have been trying to advance 3D gen from video gen back in 2024. This project is the ultimate version along this path. My great honor to participate. 3D perception is a form of visual common sense. We do not need any explicit 3D reps for generation. (As bitter lesson says)
Introducing Kaleido💮 from @AIatMeta — a universal generative neural rendering engine for photorealistic, unified object and scene view synthesis. Kaleido is built on a simple but powerful design philosophy: 3D perception is a form of visual common sense. Following this idea,
3
1
47
Want to go deeper? Our full paper details 6 findings and 3 hypotheses. Beyond what's in this thread, we study relations to Platonic Representation Hypothesis, the contributions of language vs. vision, how language can 'hack' vision, and much more! See https://t.co/9mQt3qnckL
0
0
7
Implications: Beyond a better understanding of visual priors, we show that rather than treating vision as an "add-on," stronger models can be built by instilling such priors during language pre-training from scratch, making later multimodal adaptation easier and more efficient.
1
0
5
So how can we pre-train a "vision-aware" LLM? We search between language-favorable and vision-favorable recipes to find a "balanced" one: heavy on reasoning (>50%) + a small part of visual related text (~15%). This recipe boosts visual abilities by clear margins in 7B+1T scale.
1
0
5
Perception Priors emerge from broad, diverse text (like web crawl). This builds a vocabulary for "what things are". Strong perception priors lead to better visual perception skills, such as stronger OCR and object exsistence VQA performances (with our proposed MLE-Bench).
1
0
6
Reasoning Priors are from structured reasoning data like code, math, and scientific papers. This teaches the LLM abstract logic, which it can then easily transfer to solve visual tasks require reasoning.
1
0
7
We show LLMs build 'visual priors'—latent visual capabilities from language pre-training that give them a massive head start for understanding the visual world. These visual priors split into two distinct components, Reasoning and Perception, with very different origins.
1
0
6
We’ll be presenting this Flex3D work at ICML soon! Unfortunately, I won’t be able to attend due to a pending visa, but Filippos @filippos_kok will present it in the 11am session on 17 July (W-216). Drop by if you’re interested in 3D generation!
Releasing Flex3D, a two-stage pipeline for generating high-quality 3D assets in a feed-forward manner, as a further step toward high-quality 3D generation and reconstruction. Project page: https://t.co/F5tuEKjSBs
0
2
41
Cool video generation project that leverages an explicit memory mechanism! Very exciting to see this as a big fan of memory in AI!
Excited to share VMem: a novel memory mechanism for consistent video scene generation 🎞️✨ VMem evolves its understanding of scene geometry to retrieve the most relevant past frames, enabling long-term consistency 🌐 https://t.co/AHBj6j1ecE 🤗 https://t.co/FbUbJHWW4F 1/ 🧵
0
0
5
A new framework for native parallel generation in LLMs! Natively parallel generation in LLMs, backed by a full-stack release (model, engine, data, and more).
🔥 We introduce Multiverse, a new generative modeling framework for adaptive and lossless parallel generation. 🚀 Multiverse is the first open-source non-AR model to achieve AIME24 and AIME25 scores of 54% and 46% 🌐 Website: https://t.co/J9osByhWUf 🧵 1/n
0
0
5
improving the method further, adding more content to the paper, and only then releasing it. He doesn’t rush to jump into the next project, but keeps improving the method and updating the codebase. In today’ AI, this kind of patience is truly rare and admirable!!
0
0
6
Beyond the paper itself, one thing I think is especially worth mentioning is that Jianyuan has built up long-term expertise in 3D geometry and has an exceptional ability to stay calm and focused. He always keeps polishing papers after the initial submission.
1
0
5
Congrats to @jianyuan_wang and the team for winning the CVPR Best Paper!!! I was still in the office helping Jianyuan with some final edits just 5 minutes before the submission deadline (around 7am?). Now I know—even a Best Paper’s writing can be rushed out in a week 🤣.
1
6
77
🤩 It's happening today! Join us at the 2nd Workshop on Foundation Models in the Wild — Hall 4, #6, Singapore EXPO! 🔥 10 amazing invited talks 🔥 12 exciting oral presentations 🔥 Cutting-edge ideas and lively discussions 🚀 Don't miss it — come say hi and explore the future
0
13
27
We will be presenting "APE: Faster and Longer Context-Augmented Generation via Adaptive Parallel Encoding", a novel encoding method that enables: 🚀Pre-caching Contexts for Fast Inference 🐍Re-using Positions for Long Context Our poster session is located in Hall 3 and Hall 2B,
📢 Announcing our new work "APE: Faster and Longer Context-Augmented Generation via Adaptive Parallel Encoding" @iclr_conf 🚀 Enabling the efficient combination of multiple contexts with negligible prefilling cost 💅 Re-using the context window of LLMs to accommodate more and
0
23
51
Check this amazing work if interested in 3D reconstruction and geometry. One thought: co-scaling data +architecture (usually self-attn) is almost working in all AI tasks or ill-posed problems. It’s also for 3D. Since 3D data is limited, combine and co-train on multiple tasks!
Introducing VGGT (CVPR'25), a feedforward Transformer that directly infers all key 3D attributes from one, a few, or hundreds of images, in seconds! No expensive optimization needed, yet delivers SOTA results for: ✅ Camera Pose Estimation ✅ Multi-view Depth Estimation ✅ Dense
0
1
22
😀We're delighted to announce that the review stage of our 2nd FM-Wild Workshop at ICLR has successfully concluded. We extend our sincere gratitude to all authors and reviewers for their valuable contributions. 👉The accepted papers are now available at:
openreview.net
Welcome to the OpenReview homepage for ICLR 2025 Workshop FM-Wild
0
4
9