Björn Plüster
@bjoern_pl
Followers
603
Following
448
Media
21
Statuses
178
Founder and CTO of ellamind. LLM and open-source enthusiast. @ellamindAI, @DiscoResearchAI
Joined September 2023
Evals and test cases for specific tools and tasks are exactly what is needed to engineer reliable agents. Not necessary for a first PoC but once you want to iterate on prompts and tool definitions, combine multiple tasks or choose the right model they become essential.
New on the Anthropic Engineering blog: writing effective tools for LLM agents. AI agents are only as powerful as the tools we give them. So how do we make those tools more effective? We share our best tips for developers:
1
0
2
📢Thrilled to introduce ATLAS 🗺️: scaling laws beyond English, for pretraining, finetuning, and the curse of multilinguality. The largest public, multilingual scaling study to-date—we ran 774 exps (10M-8B params, 400+ languages) to answer: 🌍Are scaling laws different by
6
37
132
If you're in Germany and you can ship - dm me
Learn to ship. Shipping is a skill distinct from coding. Shipping is designing, coding, QAing, story-telling, teaching, marketing, selling, pivoting, iterating… It used to be that coding dominated in importance because of coding ability scarcity. AI will push you to go further.
0
0
0
New variant of gpt-5-thinking-mini? Note i did not ask for any specific model in my prompt so maybe it's from the system prompt?
1
0
0
OpenAI accidentally leaking access to new alpha model in chatgpt? 🤔 "chatgpt_alpha_model_external_access_reserved_gate_13" Anything special I should test?
1
0
0
Prompting and evals will be everything Prompting is agency Evals is taste
222
166
2K
entropix but with benchmarks to back it up 👀
We found a new way to get language models to reason. 🤯 No RL, no training, no verifiers, no prompting. ❌ With better sampling, base models can achieve single-shot reasoning on par with (or better than!) GRPO while avoiding its characteristic loss in generation diversity.
0
0
2
Come find out more about the tech behind why @bfl_ml is targeting 4B$ valuation 👇
AIDev 5 featuring talks by @Nils_Reimers (@cohere; on the future of search for AI/agents) and Stephen Batifol of @bfl_ml on FLUX´s image editing capabilities & others will take place on Wed, Oct. 29th in Frankfurt - register now, it´s free 👇
0
0
1
me when prompting claude
0
0
1
@OpenAI Also they released a subset on huggingface: https://t.co/FNwABxtbNS Very cool 😎
huggingface.co
0
0
0
Kudos to @OpenAI again for releasing the results of interesting and valuable benchmarks even if their models don't top the leaderboard. Love the integrity!
1
0
0
Claude Opus outperforms GPT-5 high by a significant margin on diverse economically valuable tasks. Honestly the distance between them comes as a surprise to me but it also shows how Claude's more nuanced, less robotic personality is so important for many tasks.
1
0
2
Evals don't have to slow you down - they can drastically improve pace, give devs a hill to climb and reduce failures in production.
0
0
1
Our MCP and the agent itself will be a key component to enable you to ship quality ai products. Building good evals at the speed of your ai development is key to reducing the drag while maintaining user trust.
1
0
0
We dog-food our platform to improve the agent and can use this improve our product on both dimensions - evaluating complex agents and building the agent that helps you throughout the process.
1
0
0
By running evals with realistic test cases on reproducible environments, we can find issues with our tools, reduce failures and ship with confidence, even in a rapidly developing product with many moving parts.
1
0
0
While working on the agent that will help users navigate our platform, build reliable evals and analyze experiments, we got to a cool first version super quickly but getting it to work reliably across iteration cycles of the platform and the agent has proven difficult.
1
0
0
All of these behaviors can be explained as subtle artifacts of imperfect rewards during RL training 🔎 Inline imports: likely a scaffold thing (files are read in chunks so edits are done where the model has read the file) but probably also a form of turn-reduction. If you can
‼️PSA on common modes of bad code that codex / claude code produce that I've come across. Keep an eye out for these patterns to avoid getting shamed in code review.
0
2
5
Have you also come across these? Are there any other recurring failure modes you've seen?
1
1
4