Shawn Lewis Profile
Shawn Lewis

@shawnup

Followers
2K
Following
1K
Media
61
Statuses
494

Founder & CTO @weights_biases. Building tools for AI.

Joined March 2011
Don't wanna be here? Send us removal request.
@shawnup
Shawn Lewis
4 months
My o1-based AI programming agent is now state of the art on SWE-Bench Verified! It resolves 64.6% of issues. This is the first fully o1-driven agent we know of. And we learned a ton building it.
Tweet media one
52
162
1K
@shawnup
Shawn Lewis
4 months
Our SWE-Bench submission has been accepted and is officially SOTA! Thanks SWE-Bench team for making such an important benchmark.
Tweet media one
12
18
272
@shawnup
Shawn Lewis
4 months
How it works:.• o1 with reasoning_mode high for all agent step and editing logic.• a gpt4o based memory component that compresses the agent’s step history.• a custom built python code editor toolset designed to efficiently use model context.• the ability to register.
7
5
161
@shawnup
Shawn Lewis
5 months
New result for my pure o1-based agent: 57.4% pass@1 on SWEBench-Verified!. Avg cost: $7.5 per instance.Avg time: 13.5 minutes per instance. Pass@3 is 67.8%. Now I'm working on "test time compute scaling", ie combining/choosing the best trajectories, to push closer to this mark.
Tweet media one
5
16
142
@shawnup
Shawn Lewis
1 year
I'm very excited to announce Weave, our new tools to track and evaluate your LLM apps. Use Weave to:.🍩log and version LLM interactions and surrounding data, from development to production.🍩experiment with prompting techniques, model changes, and parameters.🍩evaluate your
Tweet media one
4
35
135
@shawnup
Shawn Lewis
3 years
Coming soon to a notebook near you: This is our Table visualizer, powered by a new technology for building composable applications called Weave.
3
18
128
@shawnup
Shawn Lewis
4 months
Here's my writeup on the solution and how we did it: Read for o1 tips and lots of other nuggets.
3
11
121
@shawnup
Shawn Lewis
10 months
@iruletheworldmo You said: "attention isn't all you need, new architecture announcement, august 13th @ 10am pt the singularity begins".
6
1
113
@shawnup
Shawn Lewis
4 months
o1 is a different beast. Its better at doing exactly what you say. Its better at solving hard coding problems. And the advice others have given to specify the outcome you want and give it room to operate is spot on.
3
3
108
@shawnup
Shawn Lewis
10 months
@iruletheworldmo For readers, there were just more than 2000 people in a Twitter space for 1 hour, with @iruletheworldmo promising to speak, many well-respected folks in the space. 🍓 did not speak. Conclusion: do not waste your time.
1
1
78
@shawnup
Shawn Lewis
4 years
We’ve been hard at work building Tables, a new way to organize, understand and improve your data. Today we're opening it up to everyone! Try it here:.
3
26
76
@shawnup
Shawn Lewis
4 months
And I built a new typescript-based agent framework called phaseshift that's deeply integrated with Weave. I'm excited to polish it up and release it to the world!.
5
2
74
@shawnup
Shawn Lewis
10 months
@iruletheworldmo @ChatGPTapp here's your receipt: "attention isn't all you need new architecture announcement august 13th @ 10am pt the singularity begins".
1
0
67
@shawnup
Shawn Lewis
3 months
I’m incredibly proud of everything our team at @weights_biases has accomplished, and excited to keep building with the amazing folks from @CoreWeave!.
@weights_biases
Weights & Biases
3 months
Today we announced that we are being acquired by @CoreWeave, the AI Hyperscaler. 🪄🐝. We could not be prouder or more excited to join forces with this team. Our CEO, @l2k, wrote a blog post with more details:.
6
4
71
@shawnup
Shawn Lewis
4 months
I built a new "Eval Studio" along the way that I can't live without now. The concepts from this will make their way into Weave
Tweet media one
2
4
57
@shawnup
Shawn Lewis
10 months
@iruletheworldmo "attention isn't all you need new architecture announcement august 13th @ 10am pt the singularity begins" Tap this sign.
0
0
52
@shawnup
Shawn Lewis
10 months
@iruletheworldmo @elonmusk uh-huh "attention isn't all you need, new architecture announcement, august 13th @ 10am pt the singularity begins".
1
0
56
@shawnup
Shawn Lewis
4 months
Every time my tools improved, my progress accelerated. Our Weave toolkit got better every week as I did this. In particular the new playground with first-class support for running multiple trials was killer.
Tweet media one
2
0
56
@shawnup
Shawn Lewis
6 months
Wow! Google's new Willow quantum chip:. "Second, Willow performed a standard benchmark computation in under five minutes that would take one of today’s fastest supercomputers 10 septillion (that is, 10^25) years — a number that vastly exceeds the age of the Universe.".
6
11
52
@shawnup
Shawn Lewis
4 months
This is our first step into the world of AI programming tools, and we're can't wait to help our customers build more faster with these capabilities. Follow along for updates!.
3
0
51
@shawnup
Shawn Lewis
2 years
I'll be on stage at Fully Connected tomorrow, to launch something we've been working on for a very long time. Here's a little peek. Join us!
0
14
46
@shawnup
Shawn Lewis
5 months
Work smarter not harder. Subset of SWEBench-Verified, with my little agent framework:.o1, reasoning_effort=high. 57/100 resolved. $1438.o1, reasoning_effort=medium. 46/100 resolved. $1518.
6
5
46
@shawnup
Shawn Lewis
2 years
Its 2023, and you’re still talking about no-code? Weave is the world’s first “pro-code” toolkit, a UI for programmers.
Tweet media one
1
15
43
@shawnup
Shawn Lewis
10 months
@iruletheworldmo set this straight: "attention isn't all you need, new architecture announcement, august 13th @ 10am pt the singularity begins".
0
0
36
@shawnup
Shawn Lewis
4 months
It looks like there's a new SOTA submission for SWE-Bench Verified from isoform, with 70.6% resolved. That is a very impressive result. There's no information about how it works. What is isoform? The website is minimal and cryptic. @bozhao tell us more!.
2
0
36
@shawnup
Shawn Lewis
1 year
gpt-4o gets slightly better accuracy than the most recent gpt-4 on HumanEval, but 4x faster. Real-time voice is next level, but going this fast on just text is amazing by itself. 3 HumanEval trials. gpt-4-2024-04-09 -> gpt-4o-2024-05-13.87.8% -> 89.2% accuracy.9.37s -> 2.39s
Tweet media one
1
5
35
@shawnup
Shawn Lewis
10 months
I've been coding with my own command-line agent every day for the last year, and now you can too: `pip install programmer` 🧵
1
11
35
@shawnup
Shawn Lewis
2 years
@zebulgar Go ahead and build OpenAI-style talent density in Miami, lol.
2
0
28
@shawnup
Shawn Lewis
4 months
Aha! Just cracked the code for getting o3-mini to work as a programming agent. .
10
0
31
@shawnup
Shawn Lewis
3 months
@MartinShkreli @sesame @brendaniribe This is an amazing new medium of entertainment.
1
0
30
@shawnup
Shawn Lewis
5 months
Next result for my pure o1-based agent: 61.0% on SWEBench-Verified!. Avg cost: $15 per instance.Avg time: ~18 minutes per instance. My top-secret (I hope to share in a blogpost soon) cross-check algorithm automatically chooses the best of two parallel runs. Next up, 3 runs.
Tweet media one
2
1
28
@shawnup
Shawn Lewis
2 months
Claude Code + WebVR + Apple Vision Pro = Holodeck!!!. Here's a 7 minute video of me vibe coding with this setup. There's so much potential here and this is just the start.
4
3
28
@shawnup
Shawn Lewis
2 months
Why am I debugging AI’s code instead AI debugging my code?.
5
3
26
@shawnup
Shawn Lewis
4 months
@zied_houidi Nice to see this systematically confirmed. I wrote about this issue here: Did you try any prompting techniques to fix it?.
2
0
22
@shawnup
Shawn Lewis
7 months
Instant charts have been achieved internally.
1
9
23
@shawnup
Shawn Lewis
3 years
Our new Weave expression editor. Powerful, pluggable, auto-completion + type-aware UI for your data. Live now at and coming soon to a notebook near you.
0
4
20
@shawnup
Shawn Lewis
11 months
🧵 Weave team is cranking! New eval comparison UI just landed. Get a beautiful high level summary of how your evals stack up, and then drill down to look at actual data in areas of disagreement. This is a beta feature you can use now. We’d love feedback.
1
9
21
@shawnup
Shawn Lewis
4 months
There is no cap on the rate at which the world can use more inference. How does exponential inference help? Example: simulate every possible cancer drug, at some point cancer is gone. Deepseek just implies more inference.
2
0
19
@shawnup
Shawn Lewis
3 years
We're hosting the next version of @borisdayma's DALL-E mini, now called "Craiyon". Come give it a try!
Tweet media one
0
7
19
@shawnup
Shawn Lewis
4 months
Fixing this is the key to making reasoning agents work. I think better prompting should go a long way.
@zied_houidi
Zied Ben Houidi
4 months
1/12 We just found something unsettling: Today's most advanced AI models - including the latest powerhouse reasoning models - can't keep track of what actually happened. Even in a simple conversation. Our ICLR'25 paper reveals why this matters 🧵.
1
0
19
@shawnup
Shawn Lewis
2 months
Hmm…. Should I do a swebench run with this?.
@benhylak
ben
2 months
o1-pro now available in API. it's @openai's most expensive model ever. $150/1m input tokens. $600/1m output tokens.
Tweet media one
5
1
19
@shawnup
Shawn Lewis
11 months
We're on a roll adding new Weave auto-logging integrations. DSPy has a really interesting model for programming with LLMs. Tracing it with Weave (just call weave.init()!) will help you build an intuitive sense for how it works.
@soumikRakshit96
GeekyRakshit (e/mad)
11 months
📣 I am happy to announce that @weights_biases Weave is now integrated with DSPy. 🧶 Weave will automatically capture traces for DSPy. To start tracking, call `weave.init()` and use the library as normal. 👉 Learn more at
Tweet media one
0
3
18
@shawnup
Shawn Lewis
2 months
This is the improvement Claude Code needed to be great. Just keep going! I don’t care how you do it or what the context looks like. Looking forward to trying it.
@_catwu
cat
2 months
Last up: auto-compact for context management. Claude Code now automatically compacts conversation history when you approach context limits, and it does a better job preserving important info while reducing token usage.
1
0
19
@shawnup
Shawn Lewis
3 years
@jmdagdelen Great tips! We’ve built tools to help with most of these at Weights & Biases. We’d love your thoughts if you ever take a look.
2
0
19
@shawnup
Shawn Lewis
5 months
The full o1 model is amazing. With reasoning_effort=high, it trounces o1-preview on swebench using my little agent framework. This is pass@1 on a shuffled sample of 100. o1-high: 57/100 resolved, 76m tokens.o1preview: 44/100 resolved, 94m tokens
Tweet media one
4
4
19
@shawnup
Shawn Lewis
11 months
We just shipped the top requested Weave feature, the ability to add human feedback to calls!
1
4
18
@shawnup
Shawn Lewis
4 months
@mckaywrigley This is “developer unhobbling”, unaccounted for by Leopold, yet another OOM.
2
1
17
@shawnup
Shawn Lewis
3 months
A bunch of billion-$ ideas that could be shipped in a weekend are currently being over-engineered by startups all over SF.
0
2
18
@shawnup
Shawn Lewis
9 months
We just released the biggest update ever to the wandb SDK with v0.18.0. It's now significantly improved on all three quality axes: performance, robustness, efficiency. The secret? The all new "core" logging service, written from the groundup in golang.
1
6
18
@shawnup
Shawn Lewis
10 months
@iruletheworldmo You also literally said this: "attention isn't all you need new architecture announcement august 13th @ 10am pt the singularity begins".
0
1
16
@shawnup
Shawn Lewis
11 months
Weave auto-logging for the amazing @llama_index is live!. Just add these two lines to your Python Llamaindex programs:. ```.import weave.weave.init('llamaindex-project').```. and get instant tracing, debugging, and evaluations.
Tweet media one
1
6
17
@shawnup
Shawn Lewis
4 months
Being able to see the chain-of-thought is game changing.
3
2
15
@shawnup
Shawn Lewis
2 years
Last week we announced the new version of Weave: more powerful, flexible, and best of all open source under Apache2!.
0
1
16
@shawnup
Shawn Lewis
3 years
@paulg Send a pack of Marlboros with the real cigarettes replaced with candy ones.
0
0
14
@shawnup
Shawn Lewis
2 years
Had a fun conversation with Sean Kerner about Weave and what it means for LLMOps and Model Monitoring earlier this week.
@VentureBeat
VentureBeat
2 years
Weights & Biases weaves new LLMOps capabilities for AI development and model monitoring
0
4
14
@shawnup
Shawn Lewis
3 years
@sergeykarayev The corpus of written human language is a giant pre-labeled dataset that encompasses all of humanity’s knowledge to date.
0
1
15
@shawnup
Shawn Lewis
3 months
Enjoyed some precious time with my fam on pat leave. Very excited to get back to work now. Feeling proud of W&B and all we’ve built.
0
0
13
@shawnup
Shawn Lewis
6 months
o1-preview is a significantly better model than it gets credit for here on x.
2
0
13
@shawnup
Shawn Lewis
10 months
@iruletheworldmo "attention isn't all you need new architecture announcement august 13th @ 10am pt the singularity begins" promise does not hold.
0
0
14
@shawnup
Shawn Lewis
6 months
@joshm I think @LondonBreed deserves a lot of credit for this.
0
0
7
@shawnup
Shawn Lewis
1 year
The most important principle when building applications with Generative AI: Make sure you log everything to a central system. Weave makes this a no-brainer. It took @vanpelt 20 minutes to integrate Weave into his recently launched OpenUI project:
@vanpelt
Chris Van Pelt (CVP)
1 year
If you think OpenAI is cool, you’re gonna love my latest side project OpenUI. Tired of writing HTML by hand and remembering tailwind classes? Let OpenUI do it for you:
1
1
13
@shawnup
Shawn Lewis
1 year
@OpenAI's new gpt-4-turbo-2024-04-09 is out and initial eval reports look very good!. I've been poking it with HumanEval, which is a standard coding benchmark, and our new Weave Evaluation toolkit. Today the new model looks slightly worse on HumanEval than some prior models. Do
Tweet media one
1
6
12
@shawnup
Shawn Lewis
10 months
@iruletheworldmo in 3 months: "attention isn't all you need, new architecture announcement, august 13th @ 10am pt the singularity begins".
0
0
12
@shawnup
Shawn Lewis
1 year
Check out the docs to get started: or read our blog post here:
1
1
11
@shawnup
Shawn Lewis
4 months
@asankhaya @npew Oh and if you're curious. pass@5 for my result is 71.2%.
0
2
11
@shawnup
Shawn Lewis
3 months
(sound on) The vibes are good with @AnthropicAI's Claude Code! Simple web app that listens to the microphone to visualize music. Built in like 10 minutes. I may have said things like "yo claude wassup" and "make it doper" in the prompts.
1
0
11
@shawnup
Shawn Lewis
8 months
Cofounder mode: go founder mode on your cofounders until they enter founder mode.
2
0
11
@shawnup
Shawn Lewis
3 years
@Bancor Congrats team! You are builders, through thick and thin.
0
0
10
@shawnup
Shawn Lewis
9 months
Top AI use case
1
1
11
@shawnup
Shawn Lewis
9 months
Logs
Tweet media one
0
4
10
@shawnup
Shawn Lewis
10 months
Head-to-head competition on novel problems is the future of LLM evals. A very cool start here!.
@weights_biases
Weights & Biases
10 months
Introducing Eris v0.1: LLM evaluation framework using debate simulations. Developed with OpenRouter and W&B Weave, Eris assesses models on reasoning, knowledge, and communication through structured debates. See how it performs and future plans:.Read more:
Tweet media one
0
1
10
@shawnup
Shawn Lewis
6 months
I’ve spent a lot of time (and $) evaluating o1-preview on swebench-Verified over the last 2 months. Here are a few of my learnings 🧵.
2
1
10
@shawnup
Shawn Lewis
5 months
pytest-dev__pytest-10356 is a good example of an impossible task in SWEBench-Verified (which is supposed to have been validated by professional software devs as tractable) 🧵.
1
0
10
@shawnup
Shawn Lewis
2 years
The Weave engine “weaves a compute graph through the UI”
Tweet media one
1
0
10
@shawnup
Shawn Lewis
10 months
@iruletheworldmo and this: "attention isn't all you need, new architecture announcement, august 13th @ 10am pt the singularity begins".
0
0
8
@shawnup
Shawn Lewis
2 years
Our customers have been using Weave to understand their models in context for a couple years:
1
0
9
@shawnup
Shawn Lewis
9 months
First o1-preview call logged with Weave. The model gets the correct answer to this MATH dataset problem that other models usually get wrong. "129"!
Tweet media one
3
2
9
@shawnup
Shawn Lewis
1 year
Next, build up a centralized suite of Evaluations. Like unit tests in the software, these ensure you don't regress. But with AI they are even more important, Evaluations are the only way to know if you are actually making progress. Weave Evaluations make this easy:
Tweet media one
1
0
9
@shawnup
Shawn Lewis
10 months
@iruletheworldmo @flowersslop @tszzl @flyerthenag6 seriously though. tuesday. huge. "attention isn't all you need, new architecture announcement, august 13th @ 10am pt the singularity begins".
0
1
9
@shawnup
Shawn Lewis
9 months
How I use programmer to program programmer:
2
4
9
@shawnup
Shawn Lewis
4 months
Thanks for chatting! For the intrepid: buried in here is a description of how the crosscheck algorithm works.
@latentspacepod
Latent.Space 🔜 @aiDotEngineer
4 months
🆕 short pod - how @shawnup got SOTA SWE-Bench Verified. with @openai o1. by building his own tools (on @weights_biases Weave) to look at data!
Tweet media one
Tweet media two
1
1
9
@shawnup
Shawn Lewis
5 months
@Justin_Halford_ o1's baseline score is 48.9%, published by OpenAI (it was o1-preview at 41%). I could be wrong, but I don't think sonnet3.5 would beat o1 in my framework. The o1 solution seems more general because it actually does reason through what to do, and relies a lot less on its built-in.
2
0
9
@shawnup
Shawn Lewis
10 months
@iruletheworldmo Will you speak now?.
1
0
7
@shawnup
Shawn Lewis
5 months
My first full run on SWEBench-Verified with o1 beats OpenAI's published result by 3.9%. Uses my little (as yet unpublished) agent framework. This is pass@1. I think there's plenty of room to go up from here.
Tweet media one
4
0
8
@shawnup
Shawn Lewis
6 months
@peterrhague This is trolling and harassment though:
@DrAllyLouks
Dr Ally Louks
6 months
To be clear, this is where I draw the line. This is abhorrent and illegal and no one should ever have to deal with this.
Tweet media one
1
0
8
@shawnup
Shawn Lewis
2 years
You can add new ops and panels in just a few lines of Python
Tweet media one
1
0
8
@shawnup
Shawn Lewis
2 years
And Weave’s type system lets us suggest the right operations and visualizations as you need them.
Tweet media one
1
0
8
@shawnup
Shawn Lewis
10 months
@iruletheworldmo you promise: "attention isn't all you need, new architecture announcement, august 13th @ 10am pt the singularity begins".
2
0
8
@shawnup
Shawn Lewis
1 year
At @weights_biases , we try to build tools that seamlessly fit into your workflow, with minimal abstractions. Weave Tracking follows this principle. Simply decorate Python functions with `@weave.op()` to get automatic code and data versioning, and tracing to a central system.
Tweet media one
1
0
8
@shawnup
Shawn Lewis
10 months
@iruletheworldmo troll: "attention isn't all you need, new architecture announcement, august 13th @ 10am pt the singularity begins".
0
0
8
@shawnup
Shawn Lewis
4 months
@hsu_steve You can use more cheap inference to get better results. Demand is unbounded.
0
0
7
@shawnup
Shawn Lewis
4 months
@gneubig For a single rollout, avg is $7.50 per dataset instance (per swebench problem). For the crosscheck5 solution its more like $7.50*5+$5.
1
0
8
@shawnup
Shawn Lewis
1 year
We're just getting started, we've got a ton more exciting stuff in store. And we'd love to hear from you. Listening to users is the best way to make great tools, so please don't hesitate to get in touch!.
0
0
8
@shawnup
Shawn Lewis
4 months
@seventhmeal $7.50 for a single rollout. $7.50*N for a crosscheck one + like $5.
0
0
7
@shawnup
Shawn Lewis
10 months
@iruletheworldmo you're silly: "attention isn't all you need, new architecture announcement, august 13th @ 10am pt the singularity begins".
0
0
7
@shawnup
Shawn Lewis
9 months
Just pushed a new release of programmer, my command-line AI programming tool. `pip install --upgrade programmer`. This release includes trajectory capture and "programmer ui" to browse trajectories.
Tweet media one
1
1
7
@shawnup
Shawn Lewis
4 months
@mariofilhoml I’ll probably give it a shot when I can. But all models are different. Any new one will require some serious experimentation and tuning.
1
0
7
@shawnup
Shawn Lewis
8 months
I use Weave for hours every day. Each time the team adds a feature, I accelerate!.
@_ScottCondron
Scott Condron
8 months
Recent updates to @weights_biases for LLM app development .- Custom Usage and Cost Tracking, alongside automatically tracking most LLMs.- Chat view.- Evaluation dashboard.- Image support.- Programatic exports.- Logging performance. This is alongside all of the improvements for
Tweet media one
0
0
7
@shawnup
Shawn Lewis
3 months
Ha, thanks gpt4.5
Tweet media one
0
0
7
@shawnup
Shawn Lewis
4 months
@StringChaos Between 55.8% and 57.4% for a single rollout.
1
0
6