Explore tweets tagged as #toolbench
🥳🛠️Introducing ToolBench!🤖🎉 🌟Large-scale instruction tuning SFT data to equip LLMs with general tool-use capability 🔖 We release 98k data with 312k real API calls. We also release a capable model ToolLLaMA that matches ChatGPT in tool use Github: https://t.co/z85AxpXxBx
9
84
432
New AWS paper shows a small language model can learn tool calling well enough to beat much larger models. It hits 77.55% pass rate on ToolBench with 350M parameters. Tool calling means the model picks an application programming interface (API) and outputs the exact call format.
12
43
227
- We test on 4 tasks from ToolBench - ToolVerifier outperforms few-shot baselines by 22% - Self-verification alone improves avg perf by 8% - Significantly better than Tool-Augmented LLMS - Outperforms GPT3.5-T & even GPT4 on some tasks despite being based on Llama 70B 🧵(2/4)
1
1
17
🦾Today's LLM paper review ProTIP: Progressive Tool Retrieval Improves Planning (Dec 2023) by Apple introduces a contrastive learning approach for AI agents. ProTIP outperforms ChatGPT on the ToolBench dataset, achieving 24% higher Recall@K=10 in tool retrieval and 41%
1
0
1
Ready, set, install. 🛠️Service Caster has the wheels, hardware, and know-how to get your carts and equipment rolling fast. #ServiceCaster #ToolBench #BuiltToRoll
0
0
0
ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs ToolLLaMA exhibits comparable performance to ChatGPT repo: https://t.co/J5V19NtpUw abs: https://t.co/DzVjlJomiZ
11
150
610
Bigger Models Aren’t Better Agents. This Paper Proves It. Most teams assume agent quality scales with model size. This paper shows the opposite. AWS researchers fine-tuned a 350M parameter model (OPT-350M) specifically for tool calling. On ToolBench, it outperformed models
0
0
1
Minimalistic, extremely fast, and hackable researcher's toolbench for GPT models in 307 lines of code. Reaches <3.8 validation loss on wikitext-103 on a single A100 in <100 seconds. https://t.co/Jxhcg6Nfk7
0
0
0
DeepAgent absolutely destroys other agents across every benchmark. It beats ReAct-GPT-4o, CodeAct, and WebThinker on both: → Tool use tasks (ToolBench, Spotify, TMDB) → Real-world apps (WebShop, GAIA, HLE)
1
1
11
🚀 🔥ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs🔥🚀 - To facilitate tool-use capabilities within open-source LLMs. - A general tool-use framework of data construction, model training and evaluation. - ToolBench, an instruction-tuning dataset
1
1
3
Record no 110 large vise benchtop https://t.co/WBjOAvGU3V
#chickencoop #signs #ChickenDaddy #Etsy #woodstove #Toolbench
0
2
3
AI Agent Systems: Architectures, Applications, and Evaluation - 5-part "agent transformer" unifies agent design (policy, memory, tools, verifiers, environment) - 7-metric evaluation + 5 benchmark suites (AgentBench, WebArena, ToolBench, SWE-bench, GAIA) standardize assessment
1
0
1
Fragment 15 online. Telemetry sync in progress. Steering core receiving update. Toolbench uplink confirmed. #PST #Fragment15
#ToolbenchSync #Telemetry
#SteeringCore #CryptoRacing
#ShadowGrid #ModularLore
#Web3 #FactionTraction #pitstoptoken #NFTs #nft
1
0
1
Agent-based manipulation of APIs using #LLMs is a popular approach, but consistent and reliable evaluation metrics to assess this has been lacking. In this poster, we introduce a set of benchmarks called ToolBench and evaluate multiple open source #LLMs. @SambaNovaAI Researcher
0
2
5
0
0
1
AWS researchers just published a paper on arXiv - A 350M model fine-tuned for 1 epoch on Toolbench (~187k examples) reports a 77.5% pass rate on ToolBench/ToolEval. Excited to test this pattern in @RunAnywhereAI (local-first agents + fallback)
1
2
4
AWS research showed a small 350M parameter model, fine-tuned for tool use, hitting a 77.55% success rate on ToolBench, dramatically outperforming GPT-4 class models which stalled around 26%. This confirms that for high-precision, task-specific execution, Small Language Models
0
0
0
OpenBMB/ToolBench: An open platform for training, serving, and evaluating large language model for tool learning. https://t.co/zbGessquoT
0
1
5