Shawn Lewis @shawnup profile

Shawn Lewis

@shawnup

Followers

2K

Following

1K

Media

61

Statuses

494

Founder & CTO @weights_biases. Building tools for AI.

Joined March 2011

Don't wanna be here? Send us removal request.

Shawn Lewis

@shawnup

4 months

My o1-based AI programming agent is now state of the art on SWE-Bench Verified! It resolves 64.6% of issues. This is the first fully o1-driven agent we know of. And we learned a ton building it.

52

162

1K

Shawn Lewis

@shawnup

4 months

Our SWE-Bench submission has been accepted and is officially SOTA! Thanks SWE-Bench team for making such an important benchmark.

12

18

272

Shawn Lewis

@shawnup

4 months

How it works:.• o1 with reasoning_mode high for all agent step and editing logic.• a gpt4o based memory component that compresses the agent’s step history.• a custom built python code editor toolset designed to efficiently use model context.• the ability to register.

7

5

161

Shawn Lewis

@shawnup

5 months

New result for my pure o1-based agent: 57.4% pass@1 on SWEBench-Verified!. Avg cost: $7.5 per instance.Avg time: 13.5 minutes per instance. Pass@3 is 67.8%. Now I'm working on "test time compute scaling", ie combining/choosing the best trajectories, to push closer to this mark.

5

16

142

Shawn Lewis

@shawnup

1 year

I'm very excited to announce Weave, our new tools to track and evaluate your LLM apps. Use Weave to:.🍩log and version LLM interactions and surrounding data, from development to production.🍩experiment with prompting techniques, model changes, and parameters.🍩evaluate your

4

35

135

Shawn Lewis

@shawnup

3 years

Coming soon to a notebook near you: This is our Table visualizer, powered by a new technology for building composable applications called Weave.

3

18

128

Shawn Lewis

@shawnup

4 months

Here's my writeup on the solution and how we did it: Read for o1 tips and lots of other nuggets.

3

11

121

Shawn Lewis

@shawnup

10 months

@iruletheworldmo You said: "attention isn't all you need, new architecture announcement, august 13th @ 10am pt the singularity begins".

6

1

113

Shawn Lewis

@shawnup

4 months

o1 is a different beast. Its better at doing exactly what you say. Its better at solving hard coding problems. And the advice others have given to specify the outcome you want and give it room to operate is spot on.

3

108

Shawn Lewis

@shawnup

10 months

@iruletheworldmo For readers, there were just more than 2000 people in a Twitter space for 1 hour, with @iruletheworldmo promising to speak, many well-respected folks in the space. 🍓 did not speak. Conclusion: do not waste your time.

1

78

Shawn Lewis

@shawnup

4 years

We’ve been hard at work building Tables, a new way to organize, understand and improve your data. Today we're opening it up to everyone! Try it here:.

3

26

76

Shawn Lewis

@shawnup

4 months

And I built a new typescript-based agent framework called phaseshift that's deeply integrated with Weave. I'm excited to polish it up and release it to the world!.

5

2

74

Shawn Lewis

@shawnup

10 months

@iruletheworldmo @ChatGPTapp here's your receipt: "attention isn't all you need new architecture announcement august 13th @ 10am pt the singularity begins".

1

0

67

Shawn Lewis

@shawnup

3 months

I’m incredibly proud of everything our team at @weights_biases has accomplished, and excited to keep building with the amazing folks from @CoreWeave!.

Weights & Biases

@weights_biases

3 months

Today we announced that we are being acquired by @CoreWeave, the AI Hyperscaler. 🪄🐝. We could not be prouder or more excited to join forces with this team. Our CEO, @l2k, wrote a blog post with more details:.

6

4

71

Shawn Lewis

@shawnup

4 months

I built a new "Eval Studio" along the way that I can't live without now. The concepts from this will make their way into Weave

2

4

57

Shawn Lewis

@shawnup

10 months

@iruletheworldmo "attention isn't all you need new architecture announcement august 13th @ 10am pt the singularity begins" Tap this sign.

0

52

Shawn Lewis

@shawnup

10 months

@iruletheworldmo @elonmusk uh-huh "attention isn't all you need, new architecture announcement, august 13th @ 10am pt the singularity begins".

1

0

56

Shawn Lewis

@shawnup

4 months

Every time my tools improved, my progress accelerated. Our Weave toolkit got better every week as I did this. In particular the new playground with first-class support for running multiple trials was killer.

2

0

56

Shawn Lewis

@shawnup

6 months

Wow! Google's new Willow quantum chip:. "Second, Willow performed a standard benchmark computation in under five minutes that would take one of today’s fastest supercomputers 10 septillion (that is, 10^25) years — a number that vastly exceeds the age of the Universe.".

6

11

52

Shawn Lewis

@shawnup

4 months

This is our first step into the world of AI programming tools, and we're can't wait to help our customers build more faster with these capabilities. Follow along for updates!.

3

0

51

Shawn Lewis

@shawnup

2 years

I'll be on stage at Fully Connected tomorrow, to launch something we've been working on for a very long time. Here's a little peek. Join us!

0

14

46

Shawn Lewis

@shawnup

5 months

Work smarter not harder. Subset of SWEBench-Verified, with my little agent framework:.o1, reasoning_effort=high. 57/100 resolved. $1438.o1, reasoning_effort=medium. 46/100 resolved. $1518.

6

5

46

Shawn Lewis

@shawnup

2 years

Its 2023, and you’re still talking about no-code? Weave is the world’s first “pro-code” toolkit, a UI for programmers.

1

15

43

Shawn Lewis

@shawnup

10 months

@iruletheworldmo set this straight: "attention isn't all you need, new architecture announcement, august 13th @ 10am pt the singularity begins".

0

36

Shawn Lewis

@shawnup

4 months

It looks like there's a new SOTA submission for SWE-Bench Verified from isoform, with 70.6% resolved. That is a very impressive result. There's no information about how it works. What is isoform? The website is minimal and cryptic. @bozhao tell us more!.

2

0

36

Shawn Lewis

@shawnup

1 year

gpt-4o gets slightly better accuracy than the most recent gpt-4 on HumanEval, but 4x faster. Real-time voice is next level, but going this fast on just text is amazing by itself. 3 HumanEval trials. gpt-4-2024-04-09 -> gpt-4o-2024-05-13.87.8% -> 89.2% accuracy.9.37s -> 2.39s

1

5

35

Shawn Lewis

@shawnup

10 months

I've been coding with my own command-line agent every day for the last year, and now you can too: `pip install programmer` 🧵

1

11

35

Shawn Lewis

@shawnup

2 years

@zebulgar Go ahead and build OpenAI-style talent density in Miami, lol.

2

0

28

Shawn Lewis

@shawnup

4 months

Aha! Just cracked the code for getting o3-mini to work as a programming agent. .

10

0

31

Shawn Lewis

@shawnup

3 months

@MartinShkreli @sesame @brendaniribe This is an amazing new medium of entertainment.

1

0

30

Shawn Lewis

@shawnup

5 months

Next result for my pure o1-based agent: 61.0% on SWEBench-Verified!. Avg cost: $15 per instance.Avg time: ~18 minutes per instance. My top-secret (I hope to share in a blogpost soon) cross-check algorithm automatically chooses the best of two parallel runs. Next up, 3 runs.

2

1

28

Shawn Lewis

@shawnup

2 months

Claude Code + WebVR + Apple Vision Pro = Holodeck!!!. Here's a 7 minute video of me vibe coding with this setup. There's so much potential here and this is just the start.

4

3

28

Shawn Lewis

@shawnup

2 months

Why am I debugging AI’s code instead AI debugging my code?.

5

3

26

Shawn Lewis

@shawnup

4 months

@zied_houidi Nice to see this systematically confirmed. I wrote about this issue here: Did you try any prompting techniques to fix it?.

2

0

22

Shawn Lewis

@shawnup

7 months

Instant charts have been achieved internally.

1

9

23

Shawn Lewis

@shawnup

3 years

Our new Weave expression editor. Powerful, pluggable, auto-completion + type-aware UI for your data. Live now at and coming soon to a notebook near you.

0

4

20

Shawn Lewis

@shawnup

11 months

🧵 Weave team is cranking! New eval comparison UI just landed. Get a beautiful high level summary of how your evals stack up, and then drill down to look at actual data in areas of disagreement. This is a beta feature you can use now. We’d love feedback.

1

9

21

Shawn Lewis

@shawnup

4 months

There is no cap on the rate at which the world can use more inference. How does exponential inference help? Example: simulate every possible cancer drug, at some point cancer is gone. Deepseek just implies more inference.

2

0

19

Shawn Lewis

@shawnup

3 years

We're hosting the next version of @borisdayma's DALL-E mini, now called "Craiyon". Come give it a try!

0

7

19

Shawn Lewis

@shawnup

4 months

Fixing this is the key to making reasoning agents work. I think better prompting should go a long way.

Zied Ben Houidi

@zied_houidi

4 months

1/12 We just found something unsettling: Today's most advanced AI models - including the latest powerhouse reasoning models - can't keep track of what actually happened. Even in a simple conversation. Our ICLR'25 paper reveals why this matters 🧵.

1

0

19

Shawn Lewis

@shawnup

2 months

Hmm…. Should I do a swebench run with this?.

ben

@benhylak

2 months

o1-pro now available in API. it's @openai's most expensive model ever. $150/1m input tokens. $600/1m output tokens.

5

1

19

Shawn Lewis

@shawnup

11 months

We're on a roll adding new Weave auto-logging integrations. DSPy has a really interesting model for programming with LLMs. Tracing it with Weave (just call weave.init()!) will help you build an intuitive sense for how it works.

GeekyRakshit (e/mad)

@soumikRakshit96

11 months

📣 I am happy to announce that @weights_biases Weave is now integrated with DSPy. 🧶 Weave will automatically capture traces for DSPy. To start tracking, call `weave.init()` and use the library as normal. 👉 Learn more at

0

3

18

Shawn Lewis

@shawnup

2 months

This is the improvement Claude Code needed to be great. Just keep going! I don’t care how you do it or what the context looks like. Looking forward to trying it.

cat

@_catwu

2 months

Last up: auto-compact for context management. Claude Code now automatically compacts conversation history when you approach context limits, and it does a better job preserving important info while reducing token usage.

1

0

19

Shawn Lewis

@shawnup

3 years

@jmdagdelen Great tips! We’ve built tools to help with most of these at Weights & Biases. We’d love your thoughts if you ever take a look.

2

0

19

Shawn Lewis

@shawnup

5 months

The full o1 model is amazing. With reasoning_effort=high, it trounces o1-preview on swebench using my little agent framework. This is pass@1 on a shuffled sample of 100. o1-high: 57/100 resolved, 76m tokens.o1preview: 44/100 resolved, 94m tokens

4

19

Shawn Lewis

@shawnup

11 months

We just shipped the top requested Weave feature, the ability to add human feedback to calls!

1

4

18

Shawn Lewis

@shawnup

4 months

@mckaywrigley This is “developer unhobbling”, unaccounted for by Leopold, yet another OOM.

2

1

17

Shawn Lewis

@shawnup

3 months

A bunch of billion-$ ideas that could be shipped in a weekend are currently being over-engineered by startups all over SF.

0

2

18

Shawn Lewis

@shawnup

9 months

We just released the biggest update ever to the wandb SDK with v0.18.0. It's now significantly improved on all three quality axes: performance, robustness, efficiency. The secret? The all new "core" logging service, written from the groundup in golang.

1

6

18

Shawn Lewis

@shawnup

10 months

@iruletheworldmo You also literally said this: "attention isn't all you need new architecture announcement august 13th @ 10am pt the singularity begins".

0

1

16

Shawn Lewis

@shawnup

11 months

Weave auto-logging for the amazing @llama_index is live!. Just add these two lines to your Python Llamaindex programs:. ```.import weave.weave.init('llamaindex-project').```. and get instant tracing, debugging, and evaluations.

1

6

17

Shawn Lewis

@shawnup

4 months

Being able to see the chain-of-thought is game changing.

3

2

15

Shawn Lewis

@shawnup

2 years

Last week we announced the new version of Weave: more powerful, flexible, and best of all open source under Apache2!.

0

1

16

Shawn Lewis

@shawnup

3 years

@paulg Send a pack of Marlboros with the real cigarettes replaced with candy ones.

0

14

Shawn Lewis

@shawnup

2 years

Had a fun conversation with Sean Kerner about Weave and what it means for LLMOps and Model Monitoring earlier this week.

VentureBeat

@VentureBeat

2 years

Weights & Biases weaves new LLMOps capabilities for AI development and model monitoring

0

4

14

Shawn Lewis

@shawnup

3 years

@sergeykarayev The corpus of written human language is a giant pre-labeled dataset that encompasses all of humanity’s knowledge to date.

0

1

15

Shawn Lewis

@shawnup

3 months

Enjoyed some precious time with my fam on pat leave. Very excited to get back to work now. Feeling proud of W&B and all we’ve built.

0

13

Shawn Lewis

@shawnup

6 months

o1-preview is a significantly better model than it gets credit for here on x.

2

0

13

Shawn Lewis

@shawnup

10 months

@iruletheworldmo "attention isn't all you need new architecture announcement august 13th @ 10am pt the singularity begins" promise does not hold.

0

14

Shawn Lewis

@shawnup

6 months

@joshm I think @LondonBreed deserves a lot of credit for this.

0

7

Shawn Lewis

@shawnup

1 year

The most important principle when building applications with Generative AI: Make sure you log everything to a central system. Weave makes this a no-brainer. It took @vanpelt 20 minutes to integrate Weave into his recently launched OpenUI project:

Chris Van Pelt (CVP)

@vanpelt

1 year

If you think OpenAI is cool, you’re gonna love my latest side project OpenUI. Tired of writing HTML by hand and remembering tailwind classes? Let OpenUI do it for you:

1

13

Shawn Lewis

@shawnup

1 year

@OpenAI's new gpt-4-turbo-2024-04-09 is out and initial eval reports look very good!. I've been poking it with HumanEval, which is a standard coding benchmark, and our new Weave Evaluation toolkit. Today the new model looks slightly worse on HumanEval than some prior models. Do

1

6

12

Shawn Lewis

@shawnup

10 months

@iruletheworldmo in 3 months: "attention isn't all you need, new architecture announcement, august 13th @ 10am pt the singularity begins".

0

12

Shawn Lewis

@shawnup

1 year

Check out the docs to get started: or read our blog post here:

1

11

Shawn Lewis

@shawnup

4 months

@asankhaya @npew Oh and if you're curious. pass@5 for my result is 71.2%.

0

2

11

Shawn Lewis

@shawnup

3 months

(sound on) The vibes are good with @AnthropicAI's Claude Code! Simple web app that listens to the microphone to visualize music. Built in like 10 minutes. I may have said things like "yo claude wassup" and "make it doper" in the prompts.

1

0

11

Shawn Lewis

@shawnup

8 months

Cofounder mode: go founder mode on your cofounders until they enter founder mode.

2

0

11

Shawn Lewis

@shawnup

3 years

@Bancor Congrats team! You are builders, through thick and thin.

0

10

Shawn Lewis

@shawnup

9 months

Top AI use case

1

11

Shawn Lewis

@shawnup

9 months

Logs

0

4

10

Shawn Lewis

@shawnup

10 months

Head-to-head competition on novel problems is the future of LLM evals. A very cool start here!.

Weights & Biases

@weights_biases

10 months

Introducing Eris v0.1: LLM evaluation framework using debate simulations. Developed with OpenRouter and W&B Weave, Eris assesses models on reasoning, knowledge, and communication through structured debates. See how it performs and future plans:.Read more:

0

1

10

Shawn Lewis

@shawnup

6 months

I’ve spent a lot of time (and $) evaluating o1-preview on swebench-Verified over the last 2 months. Here are a few of my learnings 🧵.

2

1

10

Shawn Lewis

@shawnup

5 months

pytest-dev__pytest-10356 is a good example of an impossible task in SWEBench-Verified (which is supposed to have been validated by professional software devs as tractable) 🧵.

1

0

10

Shawn Lewis

@shawnup

2 years

The Weave engine “weaves a compute graph through the UI”

1

0

10

Shawn Lewis

@shawnup

10 months

@iruletheworldmo and this: "attention isn't all you need, new architecture announcement, august 13th @ 10am pt the singularity begins".

0

8

Shawn Lewis

@shawnup

2 years

Our customers have been using Weave to understand their models in context for a couple years:

1

0

9

Shawn Lewis

@shawnup

9 months

First o1-preview call logged with Weave. The model gets the correct answer to this MATH dataset problem that other models usually get wrong. "129"!

3

2

9

Shawn Lewis

@shawnup

1 year

Next, build up a centralized suite of Evaluations. Like unit tests in the software, these ensure you don't regress. But with AI they are even more important, Evaluations are the only way to know if you are actually making progress. Weave Evaluations make this easy:

1

0

9

Shawn Lewis

@shawnup

10 months

@iruletheworldmo @flowersslop @tszzl @flyerthenag6 seriously though. tuesday. huge. "attention isn't all you need, new architecture announcement, august 13th @ 10am pt the singularity begins".

0

1

9

Shawn Lewis

@shawnup

9 months

How I use programmer to program programmer:

2

4

9

Shawn Lewis

@shawnup

4 months

Thanks for chatting! For the intrepid: buried in here is a description of how the crosscheck algorithm works.

Latent.Space 🔜 @aiDotEngineer

@latentspacepod

4 months

🆕 short pod - how @shawnup got SOTA SWE-Bench Verified. with @openai o1. by building his own tools (on @weights_biases Weave) to look at data!

1

9

Shawn Lewis

@shawnup

5 months

@Justin_Halford_ o1's baseline score is 48.9%, published by OpenAI (it was o1-preview at 41%). I could be wrong, but I don't think sonnet3.5 would beat o1 in my framework. The o1 solution seems more general because it actually does reason through what to do, and relies a lot less on its built-in.

2

0

9

Shawn Lewis

@shawnup

10 months

@iruletheworldmo Will you speak now?.

1

0

7

Shawn Lewis

@shawnup

5 months

My first full run on SWEBench-Verified with o1 beats OpenAI's published result by 3.9%. Uses my little (as yet unpublished) agent framework. This is pass@1. I think there's plenty of room to go up from here.

4

0

8

Shawn Lewis

@shawnup

6 months

@peterrhague This is trolling and harassment though:

Dr Ally Louks

@DrAllyLouks

6 months

To be clear, this is where I draw the line. This is abhorrent and illegal and no one should ever have to deal with this.

1

0

8

Shawn Lewis

@shawnup

2 years

You can add new ops and panels in just a few lines of Python

1

0

8

Shawn Lewis

@shawnup

2 years

And Weave’s type system lets us suggest the right operations and visualizations as you need them.

1

0

8

Shawn Lewis

@shawnup

10 months

@iruletheworldmo you promise: "attention isn't all you need, new architecture announcement, august 13th @ 10am pt the singularity begins".

2

0

8

Shawn Lewis

@shawnup

1 year

At @weights_biases , we try to build tools that seamlessly fit into your workflow, with minimal abstractions. Weave Tracking follows this principle. Simply decorate Python functions with `@weave.op()` to get automatic code and data versioning, and tracing to a central system.

1

0

8

Shawn Lewis

@shawnup

10 months

@iruletheworldmo troll: "attention isn't all you need, new architecture announcement, august 13th @ 10am pt the singularity begins".

0

8

Shawn Lewis

@shawnup

4 months

@hsu_steve You can use more cheap inference to get better results. Demand is unbounded.

0

7

Shawn Lewis

@shawnup

4 months

@gneubig For a single rollout, avg is $7.50 per dataset instance (per swebench problem). For the crosscheck5 solution its more like $7.50*5+$5.

1

0

8

Shawn Lewis

@shawnup

1 year

We're just getting started, we've got a ton more exciting stuff in store. And we'd love to hear from you. Listening to users is the best way to make great tools, so please don't hesitate to get in touch!.

0

8

Shawn Lewis

@shawnup

4 months

@seventhmeal $7.50 for a single rollout. $7.50*N for a crosscheck one + like $5.

0

7

Shawn Lewis

@shawnup

10 months

@iruletheworldmo you're silly: "attention isn't all you need, new architecture announcement, august 13th @ 10am pt the singularity begins".

0

7

Shawn Lewis

@shawnup

9 months

Just pushed a new release of programmer, my command-line AI programming tool. `pip install --upgrade programmer`. This release includes trajectory capture and "programmer ui" to browse trajectories.

1

7

Shawn Lewis

@shawnup

4 months

@mariofilhoml I’ll probably give it a shot when I can. But all models are different. Any new one will require some serious experimentation and tuning.

1

0

7

Shawn Lewis

@shawnup

8 months

I use Weave for hours every day. Each time the team adds a feature, I accelerate!.

Scott Condron

@_ScottCondron

8 months

Recent updates to @weights_biases for LLM app development .- Custom Usage and Cost Tracking, alongside automatically tracking most LLMs.- Chat view.- Evaluation dashboard.- Image support.- Programatic exports.- Logging performance. This is alongside all of the improvements for

0

7

Shawn Lewis

@shawnup

3 months

Ha, thanks gpt4.5

0

7

Shawn Lewis

@shawnup

4 months

@StringChaos Between 55.8% and 57.4% for a single rollout.

1

0

6