Gideon M @gidim X Profile

Gideon M

@gidim

Followers

518

Following

1K

Media

25

Statuses

225

CEO % Co-founder @cometml, creators of Opik

https://t.co/2kLohhPKOk

New York, USA

Joined April 2009

Don't wanna be here? Send us removal request.

Gideon M

@gidim

1 year

LLMs and RAGs are powerful, but without a strong evaluation pipeline, you're flying blind. Here's how to systematically test and improve your GenAI Apps. 🧵 (1/6)

5

65

545

Gideon M

@gidim

6 days

We have seen the Opik meme coin popping up here. It gave me a laugh, but just to be clear, we have nothing to do with it and are not involved in any way 😄 No fees are being redirected to “my” wallet as they claim

26

4

27

Gideon M

@gidim

14 days

If you’re still manually editing prompts — or your LLM apps are underperforming — it’s time to treat agents like first-class systems. Try the Opik Agent Optimizer and see what happens 👇 https://t.co/a2S5vnQ46X I can’t wait to see what you build.

comet.com

Debug, evaluate, and monitor your LLM applications with comprehensive tracing and automated evaluations.

4

3

12

Gideon M

@gidim

14 days

And it doesn’t just drive novel research results. The Agent Optimizer helped @LangChainAI take a prompt with a 12% success rate and improve it to 97% in just 4 minutes.

1

3

10

Gideon M

@gidim

14 days

That’s why we built Opik Agent Optimizer — an open-source framework for agentic optimization. The workflow is simple: 1. Set up your dataset + evals 2. Choose an optimization algorithm 3. Watch your agent evolve in real time in the Opik UI

1

8

Gideon M

@gidim

14 days

Just like model training or hyperparameter tuning, agentic optimization isn’t something you do manually. You need: • datasets • evals • optimization algorithms • iteration at scale And you need a framework to enable it.

1

8

Gideon M

@gidim

14 days

If you “optimize” your agent by looking at your prompts, writing a few tweaks, and vibe checking a couple samples, you’re hurting your agent’s performance. This is like updating a model by guessing the weights.

1

7

Gideon M

@gidim

14 days

There is clearly a new phase of development emerging, centered around optimizing agents via their prompts instead of their underlying model. The problem is that most people treat this as a manual, ad-hoc process.

1

10

Gideon M

@gidim

14 days

Let’s look at the AI ecosystem over the last 18 months 👇 • @arcprize ARC-AGI is topped by self-evolving agentic systems • Algorithms like GEPA went from research → standard practice • More papers ask “How do we prompt this model for X?” instead of “How do we train a new

1

2

11

Gideon M

@gidim

14 days

Switching to GPT-5.2 won’t fix your broken agent. Neither will switching to Gemini 3, Claude 4.5, or Kimi-V100.72-Deep-Thinking-Pro-Flash-Nano. And the reason is simple: You don’t need OpenAI to train a better model. You need to train a better agent. 🧵

20

27

82

Gideon M

@gidim

24 days

6/ Opik's Open Source Agent Optimization SDK lets you: - Define metrics - Plug in datasets - Use naive chain-of-though llms, genetic, hierarchical or bayesian optimizers - Watch your agents self-tune in the loop Think: Automatic prompt engineering and dive in 👇

comet.com

Debug, evaluate, and monitor your LLM applications with comprehensive tracing and automated evaluations.

1

0

4

Gideon M

@gidim

24 days

5/ The result? ✅ Accuracy jumped from 12% to 98% ✅ Same model (GPT-4 turbo) ✅ Same infra ✅ Just a better system prompt found by code, not copywriting. Merged into Langchains core repository from our optimizer.

1

0

5

Gideon M

@gidim

24 days

4/ We ran it through our agent optimizer: 📦 Dataset: 100 diverse JSON schemas 🧪 Metric: validates_json_schema 🧠 Optimizer: Hierarchical Reflective (HAPO) 🔁 Loop: Rewrite → Score → Improve → Repeat

1

0

1

Gideon M

@gidim

24 days

3/ Here's what real prompt and agent optimization looks like: Task: Enforce JSON output in LangChain Problem: The default system prompt failed often Solution: We treated it like an optimization problem.

1

0

1

Gideon M

@gidim

24 days

2/ If you’re still tuning prompts by "feel" you’re basically "training" a transformer by guessing the weights. Prompts are parameters. Metrics are gradients. Optimization is the missing loop.

1

0

3

Gideon M

@gidim

24 days

Everyone’s building eval dashboards. But here’s the dirty secret: Most teams stare at those dashboards… and then go back to guessing prompts by hand. → That’s not engineering. That’s vibes 🧵

2

9

25

Comet

@Cometml

2 months

Opik just surpassed 15,000 stars on GitHub⭐ Guess we're not stopping anytime soon. 💻 https://t.co/yJqXZpp71M

3

4

12

Jacques Verre

@JacquesVerre

2 months

A few weeks ago I shipped dark mode in Opik in less than a day, it worked but it wasn't perfect.. Now Olesya, Opik's designer, just took it to the next level thanks Cursor. Having designers that can just ship features has been incredible !

0

1

3

Jacques Verre

@JacquesVerre

3 months

I've been working on agent optimization for real-world prompts (prompt is ~10k tokens) and our new algorithm is already up 17% ! Seeing some interesting differences between benchmarks and real-world performance, more to come soon

1

3

6

Jacques Verre

@JacquesVerre

3 months

Do people really try to one-shot features with Claude Code ? I shipped Dark Mode for Opik in less than a day but it took no less than 3 iterations before getting to something that was ready to be merged. A thread on how I use Claude Code 🧵

2

6