
Charles Arnal
@arnal_charles
Followers
91
Following
71
Media
11
Statuses
24
Postdoc at @MetaAI, mathematician ENS, Cambridge, Inria, FAIR at Meta
Paris
Joined January 2023
TL;DR: Our work deciphers what makes tool use so effective for LLMs, improving our understanding of its widely observed practical benefits. π π₯οΈ π§΅/π§΅
0
0
0
π³We then scale our experiments by finetuning Llama3 and SmolLM instruct models and show that introducing new knowledge with in-weight learning severely impacts models' existing capabilities, while tool-augmented learning promises scalability without forgetting. 10/π§΅
1
0
0
πWe validate our theoretical insights in a controlled setting by pretraining small Llama-3 models from scratch using in-weight and in-tool learning on factual databases. Tool-augmented recall outperforms in-weight memorization in terms of parameter requirements. 9/π§΅
1
0
0
πTheory: 1) We demonstrate that the number of facts a model can store in its weights is fundamentally limited by its number of parameters; 2) We derive an upper bound proving that tool-augmented models can, in principle, retrieve an unbounded number of facts. 8/π§΅.
1
0
0
ππ½In-weight learning: the model is trained to directly generate the answer from its parameters. π οΈIn-tool learning: the model learns to issue a structured tool query that retrieves the value from an external database. 7/π§΅.
1
0
0
To highlight the differences between in-weight memorization and tool-augmented reasoning, we introduce a family of factual recall tasks inspired by Physics of LLMs (@ZeyuanAllenZhu), where datasets are finite collections of facts to be retrieved upon query. 6/π§΅
1
0
0
While the former is bounded by the modelβs capacity and sensitive to forgetting, the latter offers the potential for open-ended knowledge access and generalization. In our work, we provide a rigorous theoretical framework for understanding the benefits of tool use for LLMs. 5/π§΅.
1
0
0
These capabilities mark a shift away from π’π§-π°ππ’π π‘π π₯πππ«π§π’π§π (memorizing the solution to a problem within the model's weights) towards π’π§-ππ¨π¨π₯ π₯πππ«π§π’π§π (learning to use a tool, e.g., a calculator or a request to a database, to solve a problem). 4/π§΅
1
0
0
Recently, LLMs evolved from static predictors into dynamic agents capable of reasoning, adapting, and acting over time. This has been allowed by advances in architecture and interaction design like RAG (@PSH_Lewis et al., 2021) or ToolFormer (@timo_schick et al., 2023). 3/π§΅.
1
0
0
π€ Joint work with @AmbroiseOdonnat, Sam Houliston and Vivien Cabannes at @ETH_en and @AIatMeta. 2/π§΅.
1
0
0
π€Why is tool use so effective for LLMs? . In our new work, we provide theoretical and empirical evidence that tool-augmented workflows are not just practical but also provably more scalable. ππ₯οΈ 1/π§΅
1
1
3
RT @KempeLab: Black-box Optimization for LLM Post-Training πͺ.Strong non-vacuous generalization bounds βοΈ.Privacy by design βοΈ.Robustness toβ¦.
0
11
0
RT @gary_shiu: Excited to share with you an exciting project with Jacky Yip @UWMadPhysics and @arnal_charles and @f_charton @Meta where weβ¦.
0
5
0
Shout out to @syhw, @jadecopet, @TacoCohen, @KunhaoZ, @FabianGloeckle and @PierreChambon6 for their help with the code!.
0
0
4
(8/8) Our paper also offers a complete theoretical analysis of these phenomena in a simplified setting π, along with experiments in a controlled bandits setup that illustrate our findings.
1
0
2
(7/8) In other words, one should learn more from othersβ successes than from their mistakes.
1
0
2
(6/8) Our experiments show that V < 0 (slightly more emphasis on good trajectories) leads to stable & efficient training in the off-policy setting, while letting V be 0 or positive leads to crashes:
1
0
3
(5/8) Our solution: **Asymmetric REINFORCE** (AsymRE). We add a reward baseline V:.- V < 0: More emphasis on rewarding good trajectories. - V > 0: More emphasis on punishing bad trajectories.
1
0
2
(4/8) However, standard REINFORCE (the simplest RL loss) often leads to instability & crashes in off-policy settings!
1
0
1
(3/8) Why off-policy RL? It's often simpler to implement than on-policy, especially with delayed rewards, & offers potential for greater data efficiency by allowing multiple passes over data.
1
0
1