Jaskirat Singh @ ICCV2025🌴
@1jaskiratsingh
Followers
326
Following
596
Media
26
Statuses
243
Ph.D. Candidate at Australian National University | Intern @AIatMeta GenAI | @AdobeResearch | Multimodal Fusion Models and Agents | R2E-Gym | REPA-E
Seattle, Washington
Joined June 2018
Can we optimize both the VAE tokenizer and diffusion model together in an end-to-end manner? Short Answer: Yes. 🚨 Introducing REPA-E: the first end-to-end tuning approach for jointly optimizing both the VAE and the latent diffusion model using REPA loss 🚨 Key Idea: 🧠
7
31
170
[Videos are entanglements of space and time.] Around one year ago, we released VSI-Bench, in which we studied visual spatial intelligence: a fundamental but missing pillar of current MLLMs. Today, we are excited to introduce Cambrian-S, our further step that goes beyond visual
Introducing Cambrian-S it’s a position, a dataset, a benchmark, and a model but above all, it represents our first steps toward exploring spatial supersensing in video. 🧶
2
15
67
Can LLMs accurately aggregate information over long, information-dense texts? Not yet… We introduce Oolong, a dataset of simple-to-verify information aggregation questions over long inputs. No model achieves >50% accuracy at 128K on Oolong!
7
44
163
Introducing Cambrian-S it’s a position, a dataset, a benchmark, and a model but above all, it represents our first steps toward exploring spatial supersensing in video. 🧶
13
68
468
It’s an honor to have received the @QEPrize along with my fellow laureates! But it’s also a responsibility. AI’s impact to humanity is in the hands of all of us.
Today, The King presented The Queen Elizabeth Prize for Engineering at St James's Palace, celebrating the innovations which are transforming our world. 🧠 This year’s prize honours seven pioneers whose work has shaped modern artificial intelligence. 🔗 Find out more:
92
94
2K
you can’t build superintelligence without first building supersensing
30
32
288
New eval! Code duels for LMs ⚔️ Current evals test LMs on *tasks*: "fix this bug," "write a test" But we code to achieve *goals*: maximize revenue, cut costs, win users Meet CodeClash: LMs compete via their codebases across multi-round tournaments to achieve high-level goals
25
88
354
Check out our work ThinkMorph, which thinks in multi-modalities, not just with them.
🚨Sensational title alert: we may have cracked the code to true multimodal reasoning. Meet ThinkMorph — thinking in modalities, not just with them. And what we found was... unexpected. 👀 Emergent intelligence, strong gains, and …🫣 🧵 https://t.co/2GPHnsPq7R (1/16)
1
9
28
Tests certify functional behavior; they don’t judge intent. GSO, our code optimization benchmark, now combines tests with a rubric-driven HackDetector to identify models that game the benchmark. We found that up to 30% of a model’s attempts are non-idiomatic reward hacks, which
1
3
15
We added LLM judge based hack detector to our code optimization evals and found models perform non-idiomatic code changes in upto 30% of the problems 🤯
Tests certify functional behavior; they don’t judge intent. GSO, our code optimization benchmark, now combines tests with a rubric-driven HackDetector to identify models that game the benchmark. We found that up to 30% of a model’s attempts are non-idiomatic reward hacks, which
0
2
7
end-to-end training just makes latent diffusion transformers better! with repa-e, we showed the power of end-to-end training on imagenet. today we are extending it to text-to-image (T2I) generation. #ICCV2025 🌴 🚨 Introducing "REPA-E for T2I: family of end-to-end tuned VAEs for
1
17
42
With simple changes, I was able to cut down @krea_ai's new real-time video gen's timing from 25.54s to 18.14s 🔥🚀 1. FA3 through `kernels` 2. Regional compilation 3. Selective (FP8) quantization Notes are in 🧵 below
5
13
108
Tired to go back to the original papers again and again? Our monograph: a systematic and fundamental recipe you can rely on! 📘 We’re excited to release 《The Principles of Diffusion Models》— with @DrYangSong, @gimdong58085414, @mittu1204, and @StefanoErmon. It traces the core
43
431
2K
Back in 2024, LMMs-Eval built a complete evaluation ecosystem for the MLLM/LMM community, with countless researchers contributing their models and benchmarks to raise the whole edifice. I was fortunate to be one of them: our series of video-LMM works (MovieChat, AuroraCap, VDC)
Throughout my journey in developing multimodal models, I’ve always wanted a framework that lets me plug & play modality encoders/decoders on top of an auto-regressive LLM. I want to prototype fast, try new architectures, and have my demo files scale effortlessly — with full
2
3
29
I have one PhD intern opening to do research as a part of a model training effort at the FAIR CodeGen team (latest: Code World Model). If interested, email me directly and apply at
metacareers.com
Meta's mission is to build the future of human connection and the technology that makes it possible.
7
27
235
this release was a fun joint collaboration between @canva and the repa-e team. @xingjian_leng , @YunzhongH, @ZhenchangXing, @sainingxie , @LiangZheng_06 , @advadnoun , @torchcompiled 🙏 project page: https://t.co/uUiINuccVn code:
github.com
[ICCV 2025] Official implementation of the paper: REPA-E: Unlocking VAE for End-to-End Tuning of Latent Diffusion Transformers - End2End-Diffusion/REPA-E
0
3
8
best part - all E2E-VAEs can be used within few lines of code!! from diffusers import AutoencoderKL vae = AutoencoderKL.from_pretrained("REPA-E/e2e-flux-vae").to("cuda") please see project page for more details. https://t.co/uUiINuccVn (7/n)
1
2
3
but why does end-to-end tuning on imagenet generalize to complex t2i training? answer: better latent space structure. >> we found that this works because e2e-tuning automatically injects better spatial structure and semantic details into the VAE representations. Thus E2E-VAEs
1
2
2
what about impact of end-to-end training on reconstruction performance of vae's? >> end-to-end vae's despite being only tuned on ImageNet 256×256 improve generation performance while maintaining reconstruction fidelity across challenging scenes with multiple faces, subjects and
1
2
3
how does it compare with repa? repa definitely helps. end-to-end tuning helps even more! >> surprisingly, we observed that once end-to-end tuned, E2E-VAEs lead to better performance over repa w/o requiring additional representation alignment losses during T2I training. (4/n)
1
2
3
but do we require massive compute and datasets for end-to-end training for T2I? turns out no! >> we can end-to-end tune the vae on something as simple as imagenet, and use this end-to-end tuned VAE (E2E-VAE) for T2I training. and it just works! E2E-VAEs tuned on just imagenet
1
2
3