Josh Greaves
@joshgreaves_ml
Followers
324
Following
132
Media
4
Statuses
52
Tech Lead @withmartian | RL, LLMs, Routing, ML for Science | Ex-@google Brain @googledeepmind
Joined April 2025
The big labs are betting RL will unlock superhuman coding. But their infrastructure is closed, and OSS tooling doesn't support true online RL—just iterative batch optimization. We're releasing ARES to close that gap 🧵
Announcing ARES - our open-source Agentic Research and Evaluation Suite. ARES is built around 3 pillars (👇 see the thread) to make reinforcement learning for code agents easy. We’ve also found it to be incredibly useful for our own mech interp research.
9
28
225
Interested in more deep dives like this? Like or DM me. Should we do a full MI series with ARES? 👀
0
0
5
What does mechanistic interpretability look like in agent systems? @Narmeen29013644 walks through using ARES and TransformerLens to train linear probes and deploy steering vectors in a simple agent setup.
LLM agents ignore their own environment — skipping outputs, misreading tools, repeating failures. Turns out the model knows when it’s wrong. 87% probe accuracy from activations. How we found it, fixed it, and how to try it with ARES + TransformerLens 🧵
1
2
33
The most important trend in open-source agents right now: research logic is separating from infrastructure. SkyRL × Tinker decouples how you train from where you train. Excited to try this out! Congrats to @tyler_griggs_ and the @NovaSkyAI team 🚀
SkyRL now implements the Tinker API. Now, training scripts written for Tinker can run on your own GPUs with zero code changes using SkyRL's FSDP2, Megatron, and vLLM backends. Blog: https://t.co/GAtW81jM38 🧵
1
1
26
SkyRL now implements the Tinker API. Now, training scripts written for Tinker can run on your own GPUs with zero code changes using SkyRL's FSDP2, Megatron, and vLLM backends. Blog: https://t.co/GAtW81jM38 🧵
6
54
230
ARES makes it easy to add new agents for RL training. Should we add OpenClaw next? 👀
Thinking of training an RL model specialized for @openclaw using ARES. So it gets better performance and much lower token costs. Is this something folks would be interested in using? 100 likes and I put up an endpoint
0
1
23
We'll be presenting the ARES roadmap at office hours tomorrow at 2pm PT. If you're interested in agents | RL | interp and want to contribute to open-source send me a DM for more info. https://t.co/tGymWgVm5b
github.com
Agentic Research and Evaluation Suite. Contribute to withmartian/ares development by creating an account on GitHub.
0
6
31
Call for contributors: We're working on the roadmap for ARES and will be hosting office hours later this week to discuss. If you're interested in attending, send me a DM.
The big labs are betting RL will unlock superhuman coding. But their infrastructure is closed, and OSS tooling doesn't support true online RL—just iterative batch optimization. We're releasing ARES to close that gap 🧵
3
1
31
This is a preview of many more tasks to come for Ares!
ARES uses the Harbor task format ( @alexgshaw ). It comes with SWE-Bench Verified, TerminalBench2, SWESmith, and everything else in the Harbor ecosystem. We're also releasing 1k new JavaScript tasks with @VmaxAI ( @MavorParker @matthewjsargent ) to help the ecosystem grow.
1
5
19
RL progress is bottlenecked by infra for training and evaluation. @VmaxAI is excited to be partnering @withmartian, generating environments for the Agentic Research and Evaluation (ARES) framework
7
30
74
Join us on Discord: https://t.co/MVYSDQ931I Special shout out to everyone on the @withmartian team involved in this, and to our first community contributor @NithinSarva!
discord.com
Discord is great for playing games and chilling with friends, or even building a worldwide community. Customize your own space to talk, play, and hang out.
1
0
6
I'll be sharing more over the coming weeks: what we've learned using ARES internally at @withmartian, engineering deep-dives, and community projects. GitHub: https://t.co/tGymWgVm5b If you're building a coding agent, we'd love to see what you do with it.
github.com
Agentic Research and Evaluation Suite. Contribute to withmartian/ares development by creating an account on GitHub.
1
0
8
To be clear about what ARES is and isn't: ✓ Infrastructure for online RL on coding agents ✓ Fast parallel evaluation ✓ Bring-your-own-agent design ✗ Not a pretrained agent ✗ Not yet proof that online RL wins—that's what we want to find out with the community!
1
0
10
ARES uses the Harbor task format ( @alexgshaw ). It comes with SWE-Bench Verified, TerminalBench2, SWESmith, and everything else in the Harbor ecosystem. We're also releasing 1k new JavaScript tasks with @VmaxAI ( @MavorParker @matthewjsargent ) to help the ecosystem grow.
1
3
15
Throughput in training/evals and RL is a bottleneck. ARES is highly parallel. We evaluate all of SWE-Bench Verified in ~20 minutes using @daytonaio for remote sandboxing. Async gym-like API, built for remote sandboxes, hosted models, and distributed training.
2
0
13
This means you can swap code agent scaffolding without touching your training code. Iterate on your code agent, keep your RL setup stable. It's also the right interface for interpretability research on sequential decision-making, which is how we use it internally.
🧵 Mechanistic interpretability seems to be stuck in the short-horizon era, but models aren’t. As AI systems become agents that plan, act, observe, and adapt over many steps, our interpretability tools are falling behind. Here’s why long-horizon tasks are the next frontier 👇
1
0
13
The key design choice: ARES treats the LLM itself as the RL agent, not the scaffolding around it. Most agent libraries put the coding agent (tools + prompts + LLM) in the agent box. ARES pushes that into the environment. Observations are LLM requests, actions are LLM responses.
2
0
11
If you're building a coding agent and want to optimize it with RL, ARES is the infrastructure layer. Bring your agent. Bring your tasks. Bring your model. ARES handles sandboxing, parallelism, and the training interface.
1
0
10
We don't yet have definitive proof online RL will transform coding agents. But the structural arguments are strong, and we think the bottleneck is infrastructure, not ideas. ARES is intended to let researchers actually test this hypothesis.
1
0
11
Online RL tightens this loop. As the policy changes, the data distribution shifts in real time. This matters for long-horizon, sparse-reward problems (like coding) where exploration matters most; agents must navigate large codebases, recover from mistakes, and uncover solutions.
1
0
10