Josh Greaves @joshgreaves_ml X Profile

Josh Greaves

@joshgreaves_ml

Followers

324

Following

132

Media

4

Statuses

52

Tech Lead @withmartian | RL, LLMs, Routing, ML for Science | Ex-@google Brain @googledeepmind

Joined April 2025

Don't wanna be here? Send us removal request.

Josh Greaves

@joshgreaves_ml

22 days

The big labs are betting RL will unlock superhuman coding. But their infrastructure is closed, and OSS tooling doesn't support true online RL—just iterative batch optimization. We're releasing ARES to close that gap 🧵

Martian

@withmartian

22 days

Announcing ARES - our open-source Agentic Research and Evaluation Suite. ARES is built around 3 pillars (👇 see the thread) to make reinforcement learning for code agents easy. We’ve also found it to be incredibly useful for our own mech interp research.

9

28

225

Josh Greaves

@joshgreaves_ml

1 day

Interested in more deep dives like this? Like or DM me. Should we do a full MI series with ARES? 👀

0

5

Josh Greaves

@joshgreaves_ml

1 day

What does mechanistic interpretability look like in agent systems? @Narmeen29013644 walks through using ARES and TransformerLens to train linear probes and deploy steering vectors in a simple agent setup.

Narmeen Oozeer

@Narmeen29013644

1 day

LLM agents ignore their own environment — skipping outputs, misreading tools, repeating failures. Turns out the model knows when it’s wrong. 87% probe accuracy from activations. How we found it, fixed it, and how to try it with ARES + TransformerLens 🧵

1

2

33

Josh Greaves

@joshgreaves_ml

7 days

The most important trend in open-source agents right now: research logic is separating from infrastructure. SkyRL × Tinker decouples how you train from where you train. Excited to try this out! Congrats to @tyler_griggs_ and the @NovaSkyAI team 🚀

Tyler Griggs

@tyler_griggs_

7 days

SkyRL now implements the Tinker API. Now, training scripts written for Tinker can run on your own GPUs with zero code changes using SkyRL's FSDP2, Megatron, and vLLM backends. Blog: https://t.co/GAtW81jM38 🧵

1

26

Tyler Griggs

@tyler_griggs_

7 days

SkyRL now implements the Tinker API. Now, training scripts written for Tinker can run on your own GPUs with zero code changes using SkyRL's FSDP2, Megatron, and vLLM backends. Blog: https://t.co/GAtW81jM38 🧵

6

54

230

Josh Greaves

@joshgreaves_ml

11 days

ARES makes it easy to add new agents for RL training. Should we add OpenClaw next? 👀

Shriyash Upadhyay

@shriyashku

11 days

Thinking of training an RL model specialized for @openclaw using ARES. So it gets better performance and much lower token costs. Is this something folks would be interested in using? 100 likes and I put up an endpoint

0

1

23

Josh Greaves

@joshgreaves_ml

15 days

We'll be presenting the ARES roadmap at office hours tomorrow at 2pm PT. If you're interested in agents | RL | interp and want to contribute to open-source send me a DM for more info. https://t.co/tGymWgVm5b

github.com

Agentic Research and Evaluation Suite. Contribute to withmartian/ares development by creating an account on GitHub.

0

6

31

Josh Greaves

@joshgreaves_ml

17 days

Call for contributors: We're working on the roadmap for ARES and will be hosting office hours later this week to discuss. If you're interested in attending, send me a DM.

Josh Greaves

@joshgreaves_ml

22 days

The big labs are betting RL will unlock superhuman coding. But their infrastructure is closed, and OSS tooling doesn't support true online RL—just iterative batch optimization. We're releasing ARES to close that gap 🧵

3

1

31

Josh Greaves

@joshgreaves_ml

17 days

https://t.co/5sK65jILnw

2

0

15

Augustine Mavor-Parker

@MavorParker

22 days

This is a preview of many more tasks to come for Ares!

Josh Greaves

@joshgreaves_ml

22 days

ARES uses the Harbor task format ( @alexgshaw ). It comes with SWE-Bench Verified, TerminalBench2, SWESmith, and everything else in the Harbor ecosystem. We're also releasing 1k new JavaScript tasks with @VmaxAI ( @MavorParker @matthewjsargent ) to help the ecosystem grow.

1

5

19

Augustine Mavor-Parker

@MavorParker

22 days

RL progress is bottlenecked by infra for training and evaluation. @VmaxAI is excited to be partnering @withmartian, generating environments for the Agentic Research and Evaluation (ARES) framework

7

30

74

Josh Greaves

@joshgreaves_ml

22 days

Join us on Discord: https://t.co/MVYSDQ931I Special shout out to everyone on the @withmartian team involved in this, and to our first community contributor @NithinSarva!

discord.com

Discord is great for playing games and chilling with friends, or even building a worldwide community. Customize your own space to talk, play, and hang out.

1

0

6

Josh Greaves

@joshgreaves_ml

22 days

I'll be sharing more over the coming weeks: what we've learned using ARES internally at @withmartian, engineering deep-dives, and community projects. GitHub: https://t.co/tGymWgVm5b If you're building a coding agent, we'd love to see what you do with it.

github.com

Agentic Research and Evaluation Suite. Contribute to withmartian/ares development by creating an account on GitHub.

1

0

8

Josh Greaves

@joshgreaves_ml

22 days

To be clear about what ARES is and isn't: ✓ Infrastructure for online RL on coding agents ✓ Fast parallel evaluation ✓ Bring-your-own-agent design ✗ Not a pretrained agent ✗ Not yet proof that online RL wins—that's what we want to find out with the community!

1

0

10

Josh Greaves

@joshgreaves_ml

22 days

ARES uses the Harbor task format ( @alexgshaw ). It comes with SWE-Bench Verified, TerminalBench2, SWESmith, and everything else in the Harbor ecosystem. We're also releasing 1k new JavaScript tasks with @VmaxAI ( @MavorParker @matthewjsargent ) to help the ecosystem grow.

1

3

15

Josh Greaves

@joshgreaves_ml

22 days

Throughput in training/evals and RL is a bottleneck. ARES is highly parallel. We evaluate all of SWE-Bench Verified in ~20 minutes using @daytonaio for remote sandboxing. Async gym-like API, built for remote sandboxes, hosted models, and distributed training.

2

0

13

Josh Greaves

@joshgreaves_ml

22 days

This means you can swap code agent scaffolding without touching your training code. Iterate on your code agent, keep your RL setup stable. It's also the right interface for interpretability research on sequential decision-making, which is how we use it internally.

Narmeen Oozeer

@Narmeen29013644

1 month

🧵 Mechanistic interpretability seems to be stuck in the short-horizon era, but models aren’t. As AI systems become agents that plan, act, observe, and adapt over many steps, our interpretability tools are falling behind. Here’s why long-horizon tasks are the next frontier 👇

1

0

13

Josh Greaves

@joshgreaves_ml

22 days

The key design choice: ARES treats the LLM itself as the RL agent, not the scaffolding around it. Most agent libraries put the coding agent (tools + prompts + LLM) in the agent box. ARES pushes that into the environment. Observations are LLM requests, actions are LLM responses.

2

0

11

Josh Greaves

@joshgreaves_ml

22 days

If you're building a coding agent and want to optimize it with RL, ARES is the infrastructure layer. Bring your agent. Bring your tasks. Bring your model. ARES handles sandboxing, parallelism, and the training interface.

1

0

10

Josh Greaves

@joshgreaves_ml

22 days

We don't yet have definitive proof online RL will transform coding agents. But the structural arguments are strong, and we think the bottleneck is infrastructure, not ideas. ARES is intended to let researchers actually test this hypothesis.

1

0

11

Josh Greaves

@joshgreaves_ml

22 days

Online RL tightens this loop. As the policy changes, the data distribution shifts in real time. This matters for long-horizon, sparse-reward problems (like coding) where exploration matters most; agents must navigate large codebases, recover from mistakes, and uncover solutions.

1

0

10