
Ellis Brown
@_ellisbrown
Followers
670
Following
5K
Media
27
Statuses
369
Intern @Meta FAIR, PhD Student @NYU_Courant w/ Profs @sainingxie @rob_fergus. Prev @Ai2Prior, @CarnegieMellon
NYC
Joined January 2016
Cambrian-1 🪼 Through a vision-centric lens, we study every aspect of building Multimodal LLMs except the LLMs themselves. As a byproduct, we achieve superior performance at the 8B, 13B, 34B scales. 📄 https://t.co/e0e4LpWuOz 🌎 https://t.co/OP3gwG6FYE 🤗 https://t.co/P1vtX2UWkT
huggingface.co
Introducing Cambrian-1, a fully open project from our group at NYU. The world doesn't need another MLLM to rival GPT-4V. Cambrian is unique as a vision-centric exploration & here's why I think it's time to shift focus from scaling LLMs to enhancing visual representations.🧵[1/n]
2
31
133
How can an agent reverse engineer the underlying laws of an unknown, hostile & stochastic environment in “one life”, without millions of steps + human-provided goals / rewards? In our work, we: 1️⃣ infer an executable symbolic world model (a probabilistic program capturing
2
35
81
rest in px VAE. time of death: 10/13/2025
three years ago, DiT replaced the legacy unet with a transformer-based denoising backbone. we knew the bulky VAEs would be the next to go -- we just waited until we could do it right. today, we introduce Representation Autoencoders (RAE). >> Retire VAEs. Use RAEs. 👇(1/n)
0
1
9
Introducing Representation Autoencoders (RAE)! We revisit the latent space of Diffusion Transformers, replacing VAE with RAE: pretrained representation encoders (DINOv2, SigLIP2) paired with trained ViT decoders. (1/n)
6
55
479
Excited to introduce DiffuseNNX, a comprehensive JAX/Flax NNX-based library for diffusion and flow matching! It supports multiple diffusion / flow-matching frameworks, Autoencoders, DiT variants, and sampling algorithms. Repo: https://t.co/zOcA6nyrcM Delve into details below!
github.com
A comprehensive JAX/NNX library for diffusion and flow matching generative algorithms, featuring DiT (Diffusion Transformer) and its variants as the primary backbone with support for ImageNet train...
4
52
219
three years ago, DiT replaced the legacy unet with a transformer-based denoising backbone. we knew the bulky VAEs would be the next to go -- we just waited until we could do it right. today, we introduce Representation Autoencoders (RAE). >> Retire VAEs. Use RAEs. 👇(1/n)
55
325
2K
Transfusion combines autoregressive with diffusion to train a single transformer, but what if we combine Flow with Flow? 🤔 🌊OneFlow🌊 the first non-autoregressive model to generate text and images concurrently using a single transformer—unifying Edit Flow (text) with Flow
7
77
397
.@ARRay693 is presenting SAT at #COLM2025 tuesday @ 11am! go find him at poster #70 to chat about spatial intelligence 🪐🪐🪐
SAT provides free dynamic embodied experiences for models (that currently only see disembodied web data). Excited to share this work to advance research in creating more spatially aware AI models! 🤖
0
1
3
SAT provides free dynamic embodied experiences for models (that currently only see disembodied web data). Excited to share this work to advance research in creating more spatially aware AI models! 🤖
🚀Excited to introduce our latest work- SAT: Spatial Aptitude Training, a groundbreaking approach to enhance spatial reasoning in Multimodal Language Models (MLMs). SAT isn't just about understanding static object positions but dives deep into dynamic spatial reasoning. 🧵
1
4
8
Want to add that even with language-assisted visual evaluations, we're seeing encouraging progress in vision-centric benchmarks like CV-Bench ( https://t.co/WqKlwLrWQJ) and Blink ( https://t.co/HLyogAYaTL), which repurpose core vision tasks into VQA format. These benchmarks do help
arxiv.org
We introduce Blink, a new benchmark for multimodal language models (LLMs) that focuses on core visual perception abilities not found in other evaluations. Most of the Blink tasks can be solved by...
So the key concern is: Using large language models to initialize vision-language(-action) models is a tempting trap — it lets us appear to make progress without truly achieving it. Most benchmarks have overwhelmingly focused on reasoning and digital domains, without
4
15
65
Compression is the heart of intelligence From Occam to Kolmogorov—shorter programs=smarter representations Meet KARL: Kolmogorov-Approximating Representation Learning. Given an image, token budget T & target quality 𝜖 —KARL finds the smallest t≤T to reconstruct it within 𝜖🧵
14
63
354
impressive results! seems like an exciting route for inference-time scaling also kudos for the intuitive explanations / visualizations — very accessible resources in the paper+blog for understanding how EBMs work 🙇♂️
How can we unlock generalized reasoning? ⚡️Introducing Energy-Based Transformers (EBTs), an approach that out-scales (feed-forward) transformers and unlocks generalized reasoning/thinking on any modality/problem without rewards. TLDR: - EBTs are the first model to outscale the
2
0
15
How can we unlock generalized reasoning? ⚡️Introducing Energy-Based Transformers (EBTs), an approach that out-scales (feed-forward) transformers and unlocks generalized reasoning/thinking on any modality/problem without rewards. TLDR: - EBTs are the first model to outscale the
46
259
2K
1/ Maximizing confidence indeed improves reasoning. We worked with @ShashwatGoel7, @nikhilchandak29 @AmyPrb for the past 3 weeks (over a zoom call and many emails!) and revised our evaluations to align with their suggested prompts/parsers/sampling params. This includes changing
Confused about recent LLM RL results where models improve without any ground-truth signal? We were too. Until we looked at the reported numbers of the Pre-RL models and realized they were serverely underreported across papers. We compiled discrepancies in a blog below🧵👇
1
14
52
Honored to be recognized as a #CVPR2025 Outstanding Reviewer!
Behind every great conference is a team of dedicated reviewers. Congratulations to this year’s #CVPR2025 Outstanding Reviewers! https://t.co/z8w4YJKTep
0
0
34
1/ Excited to share that I’m taking on the role of leading Fundamental AI Research (FAIR) at Meta. Huge thanks to Joelle for everything. Look forward to working closely again with Yann & team.
17
22
349
I’m very excited to introduce Vy, the AI that sees and acts on your computer! It’s a first glimpse of what we’ve been working on at @Vercept_ai! Early computers trapped the world's best experts in low-level tasks–loading code, managing memory, fighting errors. Progress
12
19
78
Excited to be presenting at #ICLR2025 at 10am today on how generative classifiers are much more robust to distribution shift. Come by to chat and say hello!
2
7
93
We find training unified multimodal understanding and generation models is so easy, you do not need to tune MLLMs at all. MLLM's knowledge/reasoning/in-context learning can be transferred from multimodal understanding (text output) to generation (pixel output) even it is FROZEN!
9
67
417