Ellis Brown @_ellisbrown X Profile

Ellis Brown

@_ellisbrown

Followers

670

Following

5K

Media

27

Statuses

369

Intern @Meta FAIR, PhD Student @NYU_Courant w/ Profs @sainingxie @rob_fergus. Prev @Ai2Prior, @CarnegieMellon

https://t.co/j9MXXYQdQz

NYC

Joined January 2016

Don't wanna be here? Send us removal request.

Ellis Brown

@_ellisbrown

1 year

Cambrian-1 🪼 Through a vision-centric lens, we study every aspect of building Multimodal LLMs except the LLMs themselves. As a byproduct, we achieve superior performance at the 8B, 13B, 34B scales. 📄 https://t.co/e0e4LpWuOz 🌎 https://t.co/OP3gwG6FYE 🤗 https://t.co/P1vtX2UWkT

huggingface.co

Saining Xie

@sainingxie

1 year

Introducing Cambrian-1, a fully open project from our group at NYU. The world doesn't need another MLLM to rival GPT-4V. Cambrian is unique as a vision-centric exploration & here's why I think it's time to shift focus from scaling LLMs to enhancing visual representations.🧵[1/n]

2

31

133

Zaid Khan

@codezakh

2 days

How can an agent reverse engineer the underlying laws of an unknown, hostile & stochastic environment in “one life”, without millions of steps + human-provided goals / rewards? In our work, we: 1️⃣ infer an executable symbolic world model (a probabilistic program capturing

2

35

81

Ellis Brown

@_ellisbrown

3 days

rest in px VAE. time of death: 10/13/2025

Saining Xie

@sainingxie

4 days

three years ago, DiT replaced the legacy unet with a transformer-based denoising backbone. we knew the bulky VAEs would be the next to go -- we just waited until we could do it right. today, we introduce Representation Autoencoders (RAE). >> Retire VAEs. Use RAEs. 👇(1/n)

0

1

9

Boyang Zheng

@boyangzheng_

4 days

Introducing Representation Autoencoders (RAE)! We revisit the latent space of Diffusion Transformers, replacing VAE with RAE: pretrained representation encoders (DINOv2, SigLIP2) paired with trained ViT decoders. (1/n)

6

55

479

Willis (Nanye) Ma

@ma_nanye

4 days

Excited to introduce DiffuseNNX, a comprehensive JAX/Flax NNX-based library for diffusion and flow matching! It supports multiple diffusion / flow-matching frameworks, Autoencoders, DiT variants, and sampling algorithms. Repo: https://t.co/zOcA6nyrcM Delve into details below!

github.com

A comprehensive JAX/NNX library for diffusion and flow matching generative algorithms, featuring DiT (Diffusion Transformer) and its variants as the primary backbone with support for ImageNet train...

4

52

219

Saining Xie

@sainingxie

4 days

three years ago, DiT replaced the legacy unet with a transformer-based denoising backbone. we knew the bulky VAEs would be the next to go -- we just waited until we could do it right. today, we introduce Representation Autoencoders (RAE). >> Retire VAEs. Use RAEs. 👇(1/n)

55

325

2K

John Nguyen

@__JohnNguyen__

10 days

Transfusion combines autoregressive with diffusion to train a single transformer, but what if we combine Flow with Flow? 🤔 🌊OneFlow🌊 the first non-autoregressive model to generate text and images concurrently using a single transformer—unifying Edit Flow (text) with Flow

7

77

397

Ellis Brown

@_ellisbrown

10 days

.@ARRay693 is presenting SAT at #COLM2025 tuesday @ 11am! go find him at poster #70 to chat about spatial intelligence 🪐🪐🪐

Arijit Ray

@ARRay693

10 months

SAT provides free dynamic embodied experiences for models (that currently only see disembodied web data). Excited to share this work to advance research in creating more spatially aware AI models! 🤖

0

1

3

Arijit Ray

@ARRay693

10 months

SAT provides free dynamic embodied experiences for models (that currently only see disembodied web data). Excited to share this work to advance research in creating more spatially aware AI models! 🤖

Jiafei Duan

@DJiafei

10 months

🚀Excited to introduce our latest work- SAT: Spatial Aptitude Training, a groundbreaking approach to enhance spatial reasoning in Multimodal Language Models (MLMs). SAT isn't just about understanding static object positions but dives deep into dynamic spatial reasoning. 🧵

1

4

8

Peter Tong

@TongPetersb

2 months

Want to add that even with language-assisted visual evaluations, we're seeing encouraging progress in vision-centric benchmarks like CV-Bench ( https://t.co/WqKlwLrWQJ) and Blink ( https://t.co/HLyogAYaTL), which repurpose core vision tasks into VQA format. These benchmarks do help

arxiv.org

We introduce Blink, a new benchmark for multimodal language models (LLMs) that focuses on core visual perception abilities not found in other evaluations. Most of the Blink tasks can be solved by...

Martin Ziqiao Ma

@ziqiao_ma

2 months

So the key concern is: Using large language models to initialize vision-language(-action) models is a tempting trap — it lets us appear to make progress without truly achieving it. Most benchmarks have overwhelmingly focused on reasoning and digital domains, without

4

15

65

Shivam Duggal

@ShivamDuggal4

3 months

Compression is the heart of intelligence From Occam to Kolmogorov—shorter programs=smarter representations Meet KARL: Kolmogorov-Approximating Representation Learning. Given an image, token budget T & target quality 𝜖 —KARL finds the smallest t≤T to reconstruct it within 𝜖🧵

14

63

354

Ellis Brown

@_ellisbrown

3 months

cc @alfcnz

1

0

1

Ellis Brown

@_ellisbrown

3 months

impressive results! seems like an exciting route for inference-time scaling also kudos for the intuitive explanations / visualizations — very accessible resources in the paper+blog for understanding how EBMs work 🙇‍♂️

Alexi Gladstone

@AlexiGlad

3 months

How can we unlock generalized reasoning? ⚡️Introducing Energy-Based Transformers (EBTs), an approach that out-scales (feed-forward) transformers and unlocks generalized reasoning/thinking on any modality/problem without rewards. TLDR: - EBTs are the first model to outscale the

2

0

15

Alexi Gladstone

@AlexiGlad

3 months

How can we unlock generalized reasoning? ⚡️Introducing Energy-Based Transformers (EBTs), an approach that out-scales (feed-forward) transformers and unlocks generalized reasoning/thinking on any modality/problem without rewards. TLDR: - EBTs are the first model to outscale the

46

259

2K

Mihir Prabhudesai

@mihirp98

4 months

1/ Maximizing confidence indeed improves reasoning. We worked with @ShashwatGoel7, @nikhilchandak29 @AmyPrb for the past 3 weeks (over a zoom call and many emails!) and revised our evaluations to align with their suggested prompts/parsers/sampling params. This includes changing

Shashwat Goel

@ShashwatGoel7

5 months

Confused about recent LLM RL results where models improve without any ground-truth signal? We were too. Until we looked at the reported numbers of the Pre-RL models and realized they were serverely underreported across papers. We compiled discrepancies in a blog below🧵👇

1

14

52

Matt Deitke

@mattdeitke

4 months

Molmo won the Best Paper Honorable Mention award @CVPR! This work was a long journey over 1.5 years, from failing to get strong performance with massive scale, low quality data, to focusing on modest scale extremely high quality data! Proud to see what it became. #CVPR2025

18

227

Ellis Brown

@_ellisbrown

5 months

Honored to be recognized as a #CVPR2025 Outstanding Reviewer!

#CVPR2026

@CVPR

5 months

Behind every great conference is a team of dedicated reviewers. Congratulations to this year’s #CVPR2025 Outstanding Reviewers! https://t.co/z8w4YJKTep

0

34

Rob Fergus

@rob_fergus

5 months

1/ Excited to share that I’m taking on the role of leading Fundamental AI Research (FAIR) at Meta. Huge thanks to Joelle for everything. Look forward to working closely again with Yann & team.

17

22

349

Matt Deitke

@mattdeitke

6 months

I’m very excited to introduce Vy, the AI that sees and acts on your computer! It’s a first glimpse of what we’ve been working on at @Vercept_ai! Early computers trapped the world's best experts in low-level tasks–loading code, managing memory, fighting errors. Progress

12

19

78

Alex Li

@alexlioralexli

6 months

Excited to be presenting at #ICLR2025 at 10am today on how generative classifiers are much more robust to distribution shift. Come by to chat and say hello!

2

7

93

Xichen Pan

@xichen_pan

6 months

We find training unified multimodal understanding and generation models is so easy, you do not need to tune MLLMs at all. MLLM's knowledge/reasoning/in-context learning can be transferred from multimodal understanding (text output) to generation (pixel output) even it is FROZEN!

9

67

417