Tomas Hernando Kofman
@tomas_hk
Followers
2K
Following
686
Media
94
Statuses
551
Today we’re launching Prompt Adaptation, a state-of-the-art agentic system that automatically adapts prompts across LLMs. Prompt Adaptation outperforms all other methods and significantly improves accuracy over manual prompt engineering, saving you thousands of hours per year.
23
71
642
(if you want to try out our prompt optimizer, dm me and I can whitelist you)
1
1
2
Kimi K2 Thinking might be the world's most powerful model right now—but I will bet $100 that you're not using it correctly. Most people are just beginning to understand the fragile nature of prompts and how they drift over time. 30-60% of prompts will degrade when switching
2
2
7
Excited to share that today we are deepening our partnership with SAP. Prompt optimization is now available through SAP’s Generative AI Hub, enabling developers building in SAP’s ecosystem to automatically optimize AI prompts across different models, dramatically enhancing
1
5
16
I had fully bought into GEPA's Pareto-frontier framing for prompts, but the ACE paper changed my thinking in two important ways. 🧵
1
2
8
Rootly used Not Diamond to optimize their prompts on SRE tasks and *doubled* performance on Sonnet and nearly maxxed out GPT-5 performance (91.3% -> 97.4%). Hell yeah ♥️
While Sonnet-4.5 remains a popular choice among developers, our benchmarks show it underperforms GPT-5 on SRE-related tasks when both are run with default parameters. However, using the @notdiamond_ai prompt adaptation platform, Sonnet-4.5 achieved up to a 2x performance
0
4
21
Optimized prompts let smaller models deliver stronger results while reducing cost and latency. This matters even more in multi-prompt applications, where latency compounds at every step. dm me for access if you want to try it out 🤍
0
0
2
With Prompt Adaptation, in ~30 minutes of background processing, we automatically generate and test many prompt variations and find the best-performing one. The resulting Gemini 2.5 Flash prompt scored 97.5%, outperforming the stronger Pro baseline by 4.5%.
1
0
2
Clinc150 is a dataset for intent classification in conversational assistants. A prompt written for Gemini 2.5 Pro scored 93% accuracy. On Gemini 2.5 Flash (a faster, cheaper model), the same prompt scored 86.75%.
1
0
2
Weaker models → stronger results Strong vs weak model comparisons are normally a tradeoff between performance and cost/latency. But we can level the field with prompt adaptation.
1
3
8
Full blog post with original and adapted prompts:
notdiamond.ai
This post is part of an ongoing series where we share examples of Prompt Adaptation in practice. The goal is to highlight real scenarios, real results, and the insights that emerge from them.
0
0
4
tldr: newer models don’t automatically guarantee better results. Without adaptation, migrations often lead to regressions, technical debt, and last-minute scrambles when models are deprecated. With Prompt Adaptation, prompts are automatically optimized so teams can improve
1
0
3
The adapted prompt for Sonnet 4 reached 89% accuracy, not only reversing the regression but also surpassing both GPT-4o and Sonnet 4 with the original prompt.
1
0
3
With Prompt Adaptation, the process is automated. In ~30 minutes of background processing, the system generates many prompt variations and identifies the best-performing one.
1
0
3
Even though Sonnet 4 is stronger on benchmarks, it underperformed GPT-4o without adapting the prompt. Traditionally, fixing this requires a ton of manual trial and error. Many customers have estimated up to 40hrs of engineering work to rewrite and test prompts for new models.
1
0
3
As an example: a prompt originally written for GPT-4o (released Nov ’24, now ~2 generations old) to perform intent classification on Banking77. The original prompt scored 82.5% accuracy on GPT-4o. Running the same prompt on Sonnet 4 (released May ’25) dropped accuracy to 80%.
1
0
3
When teams migrate to a newer model with the same prompt, performance often regresses. Prompts aren’t portable, each model version interprets instructions differently.
1
0
3
Better models → worse results: why prompt adaptation matters When a new model is released, the expectation is simple: stronger benchmarks should translate into stronger real-world performance. But we generally see the opposite.
2
6
10
This debate has really captured the timeline. Sadly, most folks discussing it are mostly missing the nuance. I think Swyx understands this a lot more deeply than the folks discussing this elsewhere, so I recommend his thread here beyond a lot of the branched ones. As one of the
Claude Code: no evals [well known code agent company]: no evals [well known code agent company 2]: kinda halfassed evals [leading vibe coding company]: no evals [ceo of company selling you evals]: mmmmm yess all my top customers do evals, you should do evals [vc's in love
9
8
139
Sam is talking about personality here... but personality isn’t just what intelligence sounds like—it is *part* of intelligence. It’s the organizing system that filters, prioritizes, and directs cognition. So his quote is a pretty different position from just a few years ago.
0
3
7