Josh Greaves Profile
Josh Greaves

@joshgreaves_ml

Followers
324
Following
132
Media
4
Statuses
52

Tech Lead @withmartian | RL, LLMs, Routing, ML for Science | Ex-@google Brain @googledeepmind

Joined April 2025
Don't wanna be here? Send us removal request.
@joshgreaves_ml
Josh Greaves
22 days
The big labs are betting RL will unlock superhuman coding. But their infrastructure is closed, and OSS tooling doesn't support true online RL—just iterative batch optimization. We're releasing ARES to close that gap 🧵
@withmartian
Martian
22 days
Announcing ARES - our open-source Agentic Research and Evaluation Suite. ARES is built around 3 pillars (👇 see the thread) to make reinforcement learning for code agents easy. We’ve also found it to be incredibly useful for our own mech interp research.
9
28
225
@joshgreaves_ml
Josh Greaves
1 day
Interested in more deep dives like this? Like or DM me. Should we do a full MI series with ARES? 👀
0
0
5
@joshgreaves_ml
Josh Greaves
1 day
What does mechanistic interpretability look like in agent systems? @Narmeen29013644 walks through using ARES and TransformerLens to train linear probes and deploy steering vectors in a simple agent setup.
@Narmeen29013644
Narmeen Oozeer
1 day
LLM agents ignore their own environment — skipping outputs, misreading tools, repeating failures. Turns out the model knows when it’s wrong. 87% probe accuracy from activations. How we found it, fixed it, and how to try it with ARES + TransformerLens 🧵
1
2
33
@joshgreaves_ml
Josh Greaves
7 days
The most important trend in open-source agents right now: research logic is separating from infrastructure. SkyRL × Tinker decouples how you train from where you train. Excited to try this out! Congrats to @tyler_griggs_ and the @NovaSkyAI team 🚀
@tyler_griggs_
Tyler Griggs
7 days
SkyRL now implements the Tinker API. Now, training scripts written for Tinker can run on your own GPUs with zero code changes using SkyRL's FSDP2, Megatron, and vLLM backends. Blog: https://t.co/GAtW81jM38 🧵
1
1
26
@tyler_griggs_
Tyler Griggs
7 days
SkyRL now implements the Tinker API. Now, training scripts written for Tinker can run on your own GPUs with zero code changes using SkyRL's FSDP2, Megatron, and vLLM backends. Blog: https://t.co/GAtW81jM38 🧵
6
54
230
@joshgreaves_ml
Josh Greaves
11 days
ARES makes it easy to add new agents for RL training. Should we add OpenClaw next? 👀
@shriyashku
Shriyash Upadhyay
11 days
Thinking of training an RL model specialized for @openclaw using ARES. So it gets better performance and much lower token costs. Is this something folks would be interested in using? 100 likes and I put up an endpoint
0
1
23
@joshgreaves_ml
Josh Greaves
15 days
We'll be presenting the ARES roadmap at office hours tomorrow at 2pm PT. If you're interested in agents | RL | interp and want to contribute to open-source send me a DM for more info. https://t.co/tGymWgVm5b
Tweet card summary image
github.com
Agentic Research and Evaluation Suite. Contribute to withmartian/ares development by creating an account on GitHub.
0
6
31
@joshgreaves_ml
Josh Greaves
17 days
Call for contributors: We're working on the roadmap for ARES and will be hosting office hours later this week to discuss. If you're interested in attending, send me a DM.
@joshgreaves_ml
Josh Greaves
22 days
The big labs are betting RL will unlock superhuman coding. But their infrastructure is closed, and OSS tooling doesn't support true online RL—just iterative batch optimization. We're releasing ARES to close that gap 🧵
3
1
31
@MavorParker
Augustine Mavor-Parker
22 days
This is a preview of many more tasks to come for Ares!
@joshgreaves_ml
Josh Greaves
22 days
ARES uses the Harbor task format ( @alexgshaw ). It comes with SWE-Bench Verified, TerminalBench2, SWESmith, and everything else in the Harbor ecosystem. We're also releasing 1k new JavaScript tasks with @VmaxAI ( @MavorParker @matthewjsargent ) to help the ecosystem grow.
1
5
19
@MavorParker
Augustine Mavor-Parker
22 days
RL progress is bottlenecked by infra for training and evaluation. @VmaxAI is excited to be partnering @withmartian, generating environments for the Agentic Research and Evaluation (ARES) framework
7
30
74
@joshgreaves_ml
Josh Greaves
22 days
I'll be sharing more over the coming weeks: what we've learned using ARES internally at @withmartian, engineering deep-dives, and community projects. GitHub: https://t.co/tGymWgVm5b If you're building a coding agent, we'd love to see what you do with it.
Tweet card summary image
github.com
Agentic Research and Evaluation Suite. Contribute to withmartian/ares development by creating an account on GitHub.
1
0
8
@joshgreaves_ml
Josh Greaves
22 days
To be clear about what ARES is and isn't: ✓ Infrastructure for online RL on coding agents ✓ Fast parallel evaluation ✓ Bring-your-own-agent design ✗ Not a pretrained agent ✗ Not yet proof that online RL wins—that's what we want to find out with the community!
1
0
10
@joshgreaves_ml
Josh Greaves
22 days
ARES uses the Harbor task format ( @alexgshaw ). It comes with SWE-Bench Verified, TerminalBench2, SWESmith, and everything else in the Harbor ecosystem. We're also releasing 1k new JavaScript tasks with @VmaxAI ( @MavorParker @matthewjsargent ) to help the ecosystem grow.
1
3
15
@joshgreaves_ml
Josh Greaves
22 days
Throughput in training/evals and RL is a bottleneck. ARES is highly parallel. We evaluate all of SWE-Bench Verified in ~20 minutes using @daytonaio for remote sandboxing. Async gym-like API, built for remote sandboxes, hosted models, and distributed training.
2
0
13
@joshgreaves_ml
Josh Greaves
22 days
This means you can swap code agent scaffolding without touching your training code. Iterate on your code agent, keep your RL setup stable. It's also the right interface for interpretability research on sequential decision-making, which is how we use it internally.
@Narmeen29013644
Narmeen Oozeer
1 month
🧵 Mechanistic interpretability seems to be stuck in the short-horizon era, but models aren’t. As AI systems become agents that plan, act, observe, and adapt over many steps, our interpretability tools are falling behind. Here’s why long-horizon tasks are the next frontier 👇
1
0
13
@joshgreaves_ml
Josh Greaves
22 days
The key design choice: ARES treats the LLM itself as the RL agent, not the scaffolding around it. Most agent libraries put the coding agent (tools + prompts + LLM) in the agent box. ARES pushes that into the environment. Observations are LLM requests, actions are LLM responses.
2
0
11
@joshgreaves_ml
Josh Greaves
22 days
If you're building a coding agent and want to optimize it with RL, ARES is the infrastructure layer. Bring your agent. Bring your tasks. Bring your model. ARES handles sandboxing, parallelism, and the training interface.
1
0
10
@joshgreaves_ml
Josh Greaves
22 days
We don't yet have definitive proof online RL will transform coding agents. But the structural arguments are strong, and we think the bottleneck is infrastructure, not ideas. ARES is intended to let researchers actually test this hypothesis.
1
0
11
@joshgreaves_ml
Josh Greaves
22 days
Online RL tightens this loop. As the policy changes, the data distribution shifts in real time. This matters for long-horizon, sparse-reward problems (like coding) where exploration matters most; agents must navigate large codebases, recover from mistakes, and uncover solutions.
1
0
10