
W&B Weave
@weave_wb
Followers
1K
Following
884
Media
152
Statuses
498
A lightweight toolkit for tracking and evaluating LLM applications, built by @weights_biases for AI developers!
Land of GPUs
Joined October 2024
Your RL run just spiked at step 89! But, do you know why? We’re fixing that. Today we’re launching W&B Weave Traces to give you a step by step look into your agent’s decisions. This is the first drop from our fresh new integration with @OpenPipeAI. More RL magic is incoming.
2
32
356
Stop juggling tabs to test your prompts! 🥵 The W&B Weave Playground is your new home for iterating on and comparing LLMs. And did you know... you can now generate images right in the Playground? Just search "image" in the model dropdown!
1
2
3
Best use of @weave_wb: Popstar @ax_xiong73047 @sidk_94827 @drdannenhauer & Zohreh Dannenhauer They created a "survival of the fittest" environment for learning strategies. An LLM proposes new reward functions & PPO tweaks, and an algorithm ensures only the best adaptations
2
2
7
We asked builders at WeaveHacks 2 to push the limits of self-improving AI agents, and they delivered. With +175 builders & 66 teams, the innovation made this our hardest hackathon to judge EVER. Now, meet the winners who won over $20K in cash and prizes. 🧵
5
7
105
Your RL run just spiked at step 89! But, do you know why? We’re fixing that. Today we’re launching W&B Weave Traces to give you a step by step look into your agent’s decisions. This is the first drop from our fresh new integration with @OpenPipeAI. More RL magic is incoming.
2
32
356
What's special about @karpathy's nanochat is that it has the entire LLM lifecycle in one repo. A full-stack recipe for your own ChatGPT clone for ~$100. So cool for us to see wandb included for metric logging for the pre, mid and RL training stages.
Excited to release new repo: nanochat! (it's among the most unhinged I've written). Unlike my earlier similar repo nanoGPT which only covered pretraining, nanochat is a minimal, from scratch, full-stack training/inference pipeline of a simple ChatGPT clone in a single,
1
9
69
Watch out for the @OpenPipeAI announcement today on the @thursdai_pod around 1 hour in! 👀
POD UP: Covering my 3rd @OpenAIDevs Day in a row, and this one includes a few questions from me to @sama and @gdb from an exclusive fireside chat + a full breakdown of what they ⛴️, interview with @pvncher, Samsungs 7M TRM that beats the giants & building agents with AgentKit,
0
0
1
Seats are very limited. Register now to save your spot:
wandb.ai
In this session, we'll explore the fastest and most reliable ways to make LLMs solve real-world business problems. We'll dive deep in LLM orchestration, agent frameworks, and other similar systems...
0
0
0
RL X-mas came early. 🎄 For too long, building powerful AI agents with Reinforcement Learning has been blocked by GPU scarcity and complex infrastructure. That ends today. Introducing Serverless RL from wandb, powered by @CoreWeave! We're making RL accessible to all.
9
17
153
Want to experiment with the top open-source models? We're giving away $50 in inference credits! To get them, just comment "RAG" below our first tweet. See all of our available models here:
docs.wandb.ai
Browse the foundation models available through W&B Inference
We all know RAG is powerful, but how do retrieval depth and model choice really interact? Does retrieving more documents always improve accuracy, or does it just introduce noise and inflate costs? We ran the experiments on @weave_wb to find the precise trade-offs. 🧪
0
0
1
See our full case study with details here:
wandb.ai
Publish your model insights with interactive plots for performance metrics, predictions, and hyperparameters. Made by Brett Young using Weights & Biases
1
0
2
The key takeaway: Optimizing a RAG pipeline is a balancing act. You have to co-design your retrieval strategy and generation model. Using W&B Weave is crucial for visualizing these trade-offs and finding the most efficient configuration for your use case.
1
0
2
This is where @weave_wb was critical. It was our complete evaluation toolkit. It gave us a unified dashboard to compare experiments, let us drill down into individual predictions to debug errors, and made the complex trade-offs between cost, latency, and accuracy clear.
1
0
2
The results were fascinating. The DeepSeek model achieved its highest correctness (~77%) with just 5 retrieved passages. The GLM-4.5 model required 10 passages to reach that same score. -> This proves optimal context size is model-specific; more isn't always better.
1
0
2
For generation, we systematically tested popular open-source models using our W&B Inference service. A separate judge model evaluated correctness, while W&B Weave tracked accuracy, cost, and latency for every single run.
1
0
3
For a quick refresher: RAG (Retrieval-Augmented Generation) fights LLM limits like outdated knowledge & hallucinations. It first retrieves relevant info from a knowledge base, then uses that context to generate a grounded, accurate, and cost-effective answer.
1
1
3
We all know RAG is powerful, but how do retrieval depth and model choice really interact? Does retrieving more documents always improve accuracy, or does it just introduce noise and inflate costs? We ran the experiments on @weave_wb to find the precise trade-offs. 🧪
8
39
540