Ellis Brown Profile
Ellis Brown

@_ellisbrown

Followers
670
Following
5K
Media
27
Statuses
369

Intern @Meta FAIR, PhD Student @NYU_Courant w/ Profs @sainingxie @rob_fergus. Prev @Ai2Prior, @CarnegieMellon

NYC
Joined January 2016
Don't wanna be here? Send us removal request.
@_ellisbrown
Ellis Brown
1 year
Cambrian-1 🪼 Through a vision-centric lens, we study every aspect of building Multimodal LLMs except the LLMs themselves. As a byproduct, we achieve superior performance at the 8B, 13B, 34B scales. 📄 https://t.co/e0e4LpWuOz 🌎 https://t.co/OP3gwG6FYE 🤗 https://t.co/P1vtX2UWkT
Tweet card summary image
huggingface.co
@sainingxie
Saining Xie
1 year
Introducing Cambrian-1, a fully open project from our group at NYU. The world doesn't need another MLLM to rival GPT-4V. Cambrian is unique as a vision-centric exploration & here's why I think it's time to shift focus from scaling LLMs to enhancing visual representations.🧵[1/n]
2
31
133
@codezakh
Zaid Khan
2 days
How can an agent reverse engineer the underlying laws of an unknown, hostile & stochastic environment in “one life”, without millions of steps + human-provided goals / rewards? In our work, we: 1️⃣ infer an executable symbolic world model (a probabilistic program capturing
2
35
81
@_ellisbrown
Ellis Brown
3 days
rest in px VAE. time of death: 10/13/2025
@sainingxie
Saining Xie
4 days
three years ago, DiT replaced the legacy unet with a transformer-based denoising backbone. we knew the bulky VAEs would be the next to go -- we just waited until we could do it right. today, we introduce Representation Autoencoders (RAE). >> Retire VAEs. Use RAEs. 👇(1/n)
0
1
9
@boyangzheng_
Boyang Zheng
4 days
Introducing Representation Autoencoders (RAE)! We revisit the latent space of Diffusion Transformers, replacing VAE with RAE: pretrained representation encoders (DINOv2, SigLIP2) paired with trained ViT decoders. (1/n)
6
55
479
@ma_nanye
Willis (Nanye) Ma
4 days
Excited to introduce DiffuseNNX, a comprehensive JAX/Flax NNX-based library for diffusion and flow matching! It supports multiple diffusion / flow-matching frameworks, Autoencoders, DiT variants, and sampling algorithms. Repo: https://t.co/zOcA6nyrcM Delve into details below!
Tweet card summary image
github.com
A comprehensive JAX/NNX library for diffusion and flow matching generative algorithms, featuring DiT (Diffusion Transformer) and its variants as the primary backbone with support for ImageNet train...
4
52
219
@sainingxie
Saining Xie
4 days
three years ago, DiT replaced the legacy unet with a transformer-based denoising backbone. we knew the bulky VAEs would be the next to go -- we just waited until we could do it right. today, we introduce Representation Autoencoders (RAE). >> Retire VAEs. Use RAEs. 👇(1/n)
55
325
2K
@__JohnNguyen__
John Nguyen
10 days
Transfusion combines autoregressive with diffusion to train a single transformer, but what if we combine Flow with Flow? 🤔 🌊OneFlow🌊 the first non-autoregressive model to generate text and images concurrently using a single transformer—unifying Edit Flow (text) with Flow
7
77
397
@_ellisbrown
Ellis Brown
10 days
.@ARRay693 is presenting SAT at #COLM2025 tuesday @ 11am! go find him at poster #70 to chat about spatial intelligence 🪐🪐🪐
@ARRay693
Arijit Ray
10 months
SAT provides free dynamic embodied experiences for models (that currently only see disembodied web data). Excited to share this work to advance research in creating more spatially aware AI models! 🤖
0
1
3
@ARRay693
Arijit Ray
10 months
SAT provides free dynamic embodied experiences for models (that currently only see disembodied web data). Excited to share this work to advance research in creating more spatially aware AI models! 🤖
@DJiafei
Jiafei Duan
10 months
🚀Excited to introduce our latest work- SAT: Spatial Aptitude Training, a groundbreaking approach to enhance spatial reasoning in Multimodal Language Models (MLMs). SAT isn't just about understanding static object positions but dives deep into dynamic spatial reasoning. 🧵
1
4
8
@TongPetersb
Peter Tong
2 months
Want to add that even with language-assisted visual evaluations, we're seeing encouraging progress in vision-centric benchmarks like CV-Bench ( https://t.co/WqKlwLrWQJ) and Blink ( https://t.co/HLyogAYaTL), which repurpose core vision tasks into VQA format. These benchmarks do help
Tweet card summary image
arxiv.org
We introduce Blink, a new benchmark for multimodal language models (LLMs) that focuses on core visual perception abilities not found in other evaluations. Most of the Blink tasks can be solved by...
@ziqiao_ma
Martin Ziqiao Ma
2 months
So the key concern is: Using large language models to initialize vision-language(-action) models is a tempting trap — it lets us appear to make progress without truly achieving it. Most benchmarks have overwhelmingly focused on reasoning and digital domains, without
4
15
65
@ShivamDuggal4
Shivam Duggal
3 months
Compression is the heart of intelligence From Occam to Kolmogorov—shorter programs=smarter representations Meet KARL: Kolmogorov-Approximating Representation Learning. Given an image, token budget T & target quality 𝜖 —KARL finds the smallest t≤T to reconstruct it within 𝜖🧵
14
63
354
@_ellisbrown
Ellis Brown
3 months
cc @alfcnz
1
0
1
@_ellisbrown
Ellis Brown
3 months
impressive results! seems like an exciting route for inference-time scaling also kudos for the intuitive explanations / visualizations — very accessible resources in the paper+blog for understanding how EBMs work 🙇‍♂️
@AlexiGlad
Alexi Gladstone
3 months
How can we unlock generalized reasoning? ⚡️Introducing Energy-Based Transformers (EBTs), an approach that out-scales (feed-forward) transformers and unlocks generalized reasoning/thinking on any modality/problem without rewards. TLDR: - EBTs are the first model to outscale the
2
0
15
@AlexiGlad
Alexi Gladstone
3 months
How can we unlock generalized reasoning? ⚡️Introducing Energy-Based Transformers (EBTs), an approach that out-scales (feed-forward) transformers and unlocks generalized reasoning/thinking on any modality/problem without rewards. TLDR: - EBTs are the first model to outscale the
46
259
2K
@mihirp98
Mihir Prabhudesai
4 months
1/ Maximizing confidence indeed improves reasoning. We worked with @ShashwatGoel7, @nikhilchandak29 @AmyPrb for the past 3 weeks (over a zoom call and many emails!) and revised our evaluations to align with their suggested prompts/parsers/sampling params. This includes changing
@ShashwatGoel7
Shashwat Goel
5 months
Confused about recent LLM RL results where models improve without any ground-truth signal? We were too. Until we looked at the reported numbers of the Pre-RL models and realized they were serverely underreported across papers. We compiled discrepancies in a blog below🧵👇
1
14
52
@mattdeitke
Matt Deitke
4 months
Molmo won the Best Paper Honorable Mention award @CVPR! This work was a long journey over 1.5 years, from failing to get strong performance with massive scale, low quality data, to focusing on modest scale extremely high quality data! Proud to see what it became. #CVPR2025
18
18
227
@_ellisbrown
Ellis Brown
5 months
Honored to be recognized as a #CVPR2025 Outstanding Reviewer!
@CVPR
#CVPR2026
5 months
Behind every great conference is a team of dedicated reviewers. Congratulations to this year’s #CVPR2025 Outstanding Reviewers! https://t.co/z8w4YJKTep
0
0
34
@rob_fergus
Rob Fergus
5 months
1/ Excited to share that I’m taking on the role of leading Fundamental AI Research (FAIR) at Meta. Huge thanks to Joelle for everything. Look forward to working closely again with Yann & team.
17
22
349
@mattdeitke
Matt Deitke
6 months
I’m very excited to introduce Vy, the AI that sees and acts on your computer! It’s a first glimpse of what we’ve been working on at @Vercept_ai! Early computers trapped the world's best experts in low-level tasks–loading code, managing memory, fighting errors. Progress
12
19
78
@alexlioralexli
Alex Li
6 months
Excited to be presenting at #ICLR2025 at 10am today on how generative classifiers are much more robust to distribution shift. Come by to chat and say hello!
2
7
93
@xichen_pan
Xichen Pan
6 months
We find training unified multimodal understanding and generation models is so easy, you do not need to tune MLLMs at all. MLLM's knowledge/reasoning/in-context learning can be transferred from multimodal understanding (text output) to generation (pixel output) even it is FROZEN!
9
67
417