Jacob Phillips @jacob_dphillips X Profile

Jacob Phillips

@jacob_dphillips

Followers

806

Following

11K

Media

26

Statuses

248

Engineering Fellow @a16z, American Dynamism. prev ML @scale_AI, CTO @Themis_AI, AI + History @MIT

Joined April 2016

Don't wanna be here? Send us removal request.

Jacob Phillips

@jacob_dphillips

3 months

We’re entering a new era in robotics where generalized systems are starting to work in the real world, but researchers still don’t have good tools for understanding their data. That’s why I built ARES, an open-source platform for ingesting, annotating, and curating robotics data.

14

32

163

Jacob Phillips

@jacob_dphillips

1 day

RT @davideasnaghi: Exciting news on @diodeinc published on Business Insider today. 1/ We raised capital! Over $14.5m, most recently in a $….

0

68

0

Jacob Phillips

@jacob_dphillips

3 days

RT @rmcentush: Conflicts are won not just by what we produce, but how fast we move it. Yet military logistics still run on spreadsheets and….

0

5

0

Jacob Phillips

@jacob_dphillips

4 days

RT @svlevine: I wrote a fun little article about all the ways to dodge the need for real-world robot data. I think it has a cute title. ht….

0

116

0

Jacob Phillips

@jacob_dphillips

22 days

RT @_mattfreed: Get in. We’ve got fields to tend

0

11

0

Jacob Phillips

@jacob_dphillips

1 month

RT @_ConnorSweeney: Her brain went 6 hours without oxygen before they could operate. On New Year's Eve in 2024, a 3mm-wide clump of cells….

0

9

0

Jacob Phillips

@jacob_dphillips

1 month

RoboArena from @pranav_atreya -- real-world, scalable benchmarking for robots! Another step towards infrastructure for robot learning, similar to @lmarena_ai

Jacob Phillips

@jacob_dphillips

3 months

I wrote a second piece on “How to Build ChatGPT for Robotics”, covering the history of robot data labeling, current best practices, and what the future holds for robots – across benchmarks, safety, red-teaming, and real-world deployment.

0

7

Jacob Phillips

@jacob_dphillips

1 month

RT @SeanHendryx: What will the learning environments of the future look like that train artificial super intelligence? In recent work at @s….

0

30

0

Jacob Phillips

@jacob_dphillips

1 month

RT @jsuarez5341: PufferLib 3.0: We trained reinforcement learning agents on 1 Petabyte / 12,000 years of data with 1 server. Now you can, t….

0

93

0

Jacob Phillips

@jacob_dphillips

1 month

RT @oyhsu: Want to tinker with robots but don't have one on hand? . @jacob_dphillips on our team @a16z built MALLET, a simple toolkit for a….

0

1

0

Jacob Phillips

@jacob_dphillips

1 month

MALLET provides a simple toolkit for anyone to become a robotics researcher. Check out the Github repo at . Thanks to @zhiyuan_zhou_ for setting up AutoEval and @oyhsu, @espricewright, and the rest of the @a16z American Dynamism team for their support.

github.com

Cloud-based tools and an evaluation harness for VLMs to control real-world robots - jacobphillips99/mallet

1

0

9

Jacob Phillips

@jacob_dphillips

1 month

However, VLMs are getting stronger and stronger in multimodal reasoning. MALLET helps VLMs achieve low error approaching that of an actual VLA robot policy! Using MALLET, we can also experiment with in-context learning by providing different amounts of historical observations to

1

0

3

Jacob Phillips

@jacob_dphillips

1 month

Does it work? Not quite yet -- VLMs really struggle with embodied, 3D visual problems like occlusion, or optical illusions like parallax effect. Here's an example of gemini-2.5-flash trying to "open the drawer". From the reasoning traces, we see that the model can't tell that it

1

0

5

Jacob Phillips

@jacob_dphillips

1 month

We host CPU and GPU servers on @modal_labs, enabling researchers to train and evaluate VLMs or vision-language-action (VLA) models. We can also use MALLET as an evaluation benchmark to test the spatial reasoning capabilities of VLMs in comparison to VLAs.

1

0

4

Jacob Phillips

@jacob_dphillips

1 month

Have you ever wondered if o4-mini could control a robot? Ever wanted to do robotics research, but didn't have any robots or GPUs? MALLET is a toolkit and benchmark for letting vision-language models like GPT-4o drive robots in the real-world. MALLET is built on top of

3

9

59

Jacob Phillips

@jacob_dphillips

1 month

@chris_j_paxton On learning from real-world deployments: "Most deployed robots are doing the same task, over and over again, in the same environment. So the pool of useful robots for learning “robot GPT” is going to be quite a bit lower.".

0

3

Jacob Phillips

@jacob_dphillips

1 month

A great point from @chris_j_paxton in "It Can Think" this morning that a lot of people working in robot data collection tend to miss! This may actually be more bullish on robot learning from human videos.

2

16

Jacob Phillips

@jacob_dphillips

2 months

RT @espricewright: who is building American Dynamism and will be @CVPR in Nashville next week? . hit us up @oyhsu @jacob_dphillips @MillenA….

0

2

0

Jacob Phillips

@jacob_dphillips

2 months

Releasing updated data and datasets on @huggingface! Now compatible with @MLCommons Croissant metadata format.

huggingface.co

Jacob Phillips

@jacob_dphillips

3 months

We’re entering a new era in robotics where generalized systems are starting to work in the real world, but researchers still don’t have good tools for understanding their data. That’s why I built ARES, an open-source platform for ingesting, annotating, and curating robotics data.

1

0

16

Jacob Phillips

@jacob_dphillips

2 months

28 miles, 5k vertical feet of elevation gain, 4 tiny bass, 1 unknown skull

3

0

16

Jacob Phillips

@jacob_dphillips

2 months

The recent Sonnet release actually showed a small regression on MMMU, a visual reasoning benchmark, despite large advances in long-context reasoning for agentic coding and AIME. Excited to see better embodied reasoning benchmarks in the future!

Oliver Hsu

@oyhsu

2 months

Feels like there’s more discussion lately around evaluation criteria for physical reasoning abilities of AI. Maybe an extension of evaluating visual reasoning, but likely something wholly different. “The people yearn for benchmarks” — @jacob_dphillips.

0

1

6