Gideon M Profile
Gideon M

@gidim

Followers
518
Following
1K
Media
25
Statuses
225

CEO % Co-founder @cometml, creators of Opik

New York, USA
Joined April 2009
Don't wanna be here? Send us removal request.
@gidim
Gideon M
1 year
LLMs and RAGs are powerful, but without a strong evaluation pipeline, you're flying blind. Here's how to systematically test and improve your GenAI Apps. 🧵 (1/6)
5
65
545
@gidim
Gideon M
6 days
We have seen the Opik meme coin popping up here. It gave me a laugh, but just to be clear, we have nothing to do with it and are not involved in any way 😄 No fees are being redirected to “my” wallet as they claim
26
4
27
@gidim
Gideon M
14 days
If you’re still manually editing prompts — or your LLM apps are underperforming — it’s time to treat agents like first-class systems. Try the Opik Agent Optimizer and see what happens 👇 https://t.co/a2S5vnQ46X I can’t wait to see what you build.
Tweet card summary image
comet.com
Debug, evaluate, and monitor your LLM applications with comprehensive tracing and automated evaluations.
4
3
12
@gidim
Gideon M
14 days
And it doesn’t just drive novel research results. The Agent Optimizer helped @LangChainAI take a prompt with a 12% success rate and improve it to 97% in just 4 minutes.
1
3
10
@gidim
Gideon M
14 days
That’s why we built Opik Agent Optimizer — an open-source framework for agentic optimization. The workflow is simple: 1. Set up your dataset + evals 2. Choose an optimization algorithm 3. Watch your agent evolve in real time in the Opik UI
1
1
8
@gidim
Gideon M
14 days
Just like model training or hyperparameter tuning, agentic optimization isn’t something you do manually. You need: • datasets • evals • optimization algorithms • iteration at scale And you need a framework to enable it.
1
1
8
@gidim
Gideon M
14 days
If you “optimize” your agent by looking at your prompts, writing a few tweaks, and vibe checking a couple samples, you’re hurting your agent’s performance. This is like updating a model by guessing the weights.
1
1
7
@gidim
Gideon M
14 days
There is clearly a new phase of development emerging, centered around optimizing agents via their prompts instead of their underlying model. The problem is that most people treat this as a manual, ad-hoc process.
1
1
10
@gidim
Gideon M
14 days
Let’s look at the AI ecosystem over the last 18 months 👇 • @arcprize ARC-AGI is topped by self-evolving agentic systems • Algorithms like GEPA went from research → standard practice • More papers ask “How do we prompt this model for X?” instead of “How do we train a new
1
2
11
@gidim
Gideon M
14 days
Switching to GPT-5.2 won’t fix your broken agent. Neither will switching to Gemini 3, Claude 4.5, or Kimi-V100.72-Deep-Thinking-Pro-Flash-Nano. And the reason is simple: You don’t need OpenAI to train a better model. You need to train a better agent. 🧵
20
27
82
@gidim
Gideon M
24 days
6/ Opik's Open Source Agent Optimization SDK lets you: - Define metrics - Plug in datasets - Use naive chain-of-though llms, genetic, hierarchical or bayesian optimizers - Watch your agents self-tune in the loop Think: Automatic prompt engineering and dive in 👇
Tweet card summary image
comet.com
Debug, evaluate, and monitor your LLM applications with comprehensive tracing and automated evaluations.
1
0
4
@gidim
Gideon M
24 days
5/ The result? ✅ Accuracy jumped from 12% to 98% ✅ Same model (GPT-4 turbo) ✅ Same infra ✅ Just a better system prompt found by code, not copywriting. Merged into Langchains core repository from our optimizer.
1
0
5
@gidim
Gideon M
24 days
4/ We ran it through our agent optimizer: 📦 Dataset: 100 diverse JSON schemas 🧪 Metric: validates_json_schema 🧠 Optimizer: Hierarchical Reflective (HAPO) 🔁 Loop: Rewrite → Score → Improve → Repeat
1
0
1
@gidim
Gideon M
24 days
3/ Here's what real prompt and agent optimization looks like: Task: Enforce JSON output in LangChain Problem: The default system prompt failed often Solution: We treated it like an optimization problem.
1
0
1
@gidim
Gideon M
24 days
2/ If you’re still tuning prompts by "feel" you’re basically "training" a transformer by guessing the weights. Prompts are parameters. Metrics are gradients. Optimization is the missing loop.
1
0
3
@gidim
Gideon M
24 days
Everyone’s building eval dashboards. But here’s the dirty secret: Most teams stare at those dashboards… and then go back to guessing prompts by hand. → That’s not engineering. That’s vibes 🧵
2
9
25
@Cometml
Comet
2 months
Opik just surpassed 15,000 stars on GitHub⭐ Guess we're not stopping anytime soon. 💻 https://t.co/yJqXZpp71M
3
4
12
@JacquesVerre
Jacques Verre
2 months
A few weeks ago I shipped dark mode in Opik in less than a day, it worked but it wasn't perfect.. Now Olesya, Opik's designer, just took it to the next level thanks Cursor. Having designers that can just ship features has been incredible !
0
1
3
@JacquesVerre
Jacques Verre
3 months
I've been working on agent optimization for real-world prompts (prompt is ~10k tokens) and our new algorithm is already up 17% ! Seeing some interesting differences between benchmarks and real-world performance, more to come soon
1
3
6
@JacquesVerre
Jacques Verre
3 months
Do people really try to one-shot features with Claude Code ? I shipped Dark Mode for Opik in less than a day but it took no less than 3 iterations before getting to something that was ready to be merged. A thread on how I use Claude Code 🧵
2
2
6