bjoern_pl Profile Banner
Björn Plüster Profile
Björn Plüster

@bjoern_pl

Followers
603
Following
448
Media
21
Statuses
178

Founder and CTO of ellamind. LLM and open-source enthusiast. @ellamindAI, @DiscoResearchAI

Joined September 2023
Don't wanna be here? Send us removal request.
@bjoern_pl
Björn Plüster
2 months
Evals and test cases for specific tools and tasks are exactly what is needed to engineer reliable agents. Not necessary for a first PoC but once you want to iterate on prompts and tool definitions, combine multiple tasks or choose the right model they become essential.
@AnthropicAI
Anthropic
2 months
New on the Anthropic Engineering blog: writing effective tools for LLM agents. AI agents are only as powerful as the tools we give them. So how do we make those tools more effective? We share our best tips for developers:
1
0
2
@ShayneRedford
Shayne Longpre
9 days
📢Thrilled to introduce ATLAS 🗺️: scaling laws beyond English, for pretraining, finetuning, and the curse of multilinguality. The largest public, multilingual scaling study to-date—we ran 774 exps (10M-8B params, 400+ languages) to answer: 🌍Are scaling laws different by
6
37
132
@bjoern_pl
Björn Plüster
10 days
If you're in Germany and you can ship - dm me
@rauchg
Guillermo Rauch
11 days
Learn to ship. Shipping is a skill distinct from coding. Shipping is designing, coding, QAing, story-telling, teaching, marketing, selling, pivoting, iterating… It used to be that coding dominated in importance because of coding ability scarcity. AI will push you to go further.
0
0
0
@bjoern_pl
Björn Plüster
11 days
New variant of gpt-5-thinking-mini? Note i did not ask for any specific model in my prompt so maybe it's from the system prompt?
1
0
0
@bjoern_pl
Björn Plüster
11 days
OpenAI accidentally leaking access to new alpha model in chatgpt? 🤔 "chatgpt_alpha_model_external_access_reserved_gate_13" Anything special I should test?
1
0
0
@bjoern_pl
Björn Plüster
15 days
If you are in Germany and have a taste for evals - dm me!
@garrytan
Garry Tan
16 days
Prompting and evals will be everything Prompting is agency Evals is taste
0
0
2
@garrytan
Garry Tan
16 days
Prompting and evals will be everything Prompting is agency Evals is taste
222
166
2K
@bjoern_pl
Björn Plüster
20 days
entropix but with benchmarks to back it up 👀
@aakaran31
Aayush Karan
20 days
We found a new way to get language models to reason. 🤯 No RL, no training, no verifiers, no prompting. ❌ With better sampling, base models can achieve single-shot reasoning on par with (or better than!) GRPO while avoiding its characteristic loss in generation diversity.
0
0
2
@bjoern_pl
Björn Plüster
24 days
Come find out more about the tech behind why @bfl_ml is targeting 4B$ valuation 👇
@jphme
Jan P. Harries
24 days
AIDev 5 featuring talks by @Nils_Reimers (@cohere; on the future of search for AI/agents) and Stephen Batifol of @bfl_ml on FLUX´s image editing capabilities & others will take place on Wed, Oct. 29th in Frankfurt - register now, it´s free 👇
0
0
1
@bjoern_pl
Björn Plüster
1 month
me when prompting claude
@wilsonhou
wilson hou
1 month
details are care @pacocoursey @raunofreiberg
0
0
1
@bjoern_pl
Björn Plüster
1 month
@OpenAI Also they released a subset on huggingface: https://t.co/FNwABxtbNS Very cool 😎
Tweet card summary image
huggingface.co
0
0
0
@bjoern_pl
Björn Plüster
1 month
Kudos to @OpenAI again for releasing the results of interesting and valuable benchmarks even if their models don't top the leaderboard. Love the integrity!
1
0
0
@bjoern_pl
Björn Plüster
1 month
Claude Opus outperforms GPT-5 high by a significant margin on diverse economically valuable tasks. Honestly the distance between them comes as a surprise to me but it also shows how Claude's more nuanced, less robotic personality is so important for many tasks.
1
0
2
@bjoern_pl
Björn Plüster
2 months
Evals don't have to slow you down - they can drastically improve pace, give devs a hill to climb and reduce failures in production.
0
0
1
@bjoern_pl
Björn Plüster
2 months
Our MCP and the agent itself will be a key component to enable you to ship quality ai products. Building good evals at the speed of your ai development is key to reducing the drag while maintaining user trust.
1
0
0
@bjoern_pl
Björn Plüster
2 months
We dog-food our platform to improve the agent and can use this improve our product on both dimensions - evaluating complex agents and building the agent that helps you throughout the process.
1
0
0
@bjoern_pl
Björn Plüster
2 months
By running evals with realistic test cases on reproducible environments, we can find issues with our tools, reduce failures and ship with confidence, even in a rapidly developing product with many moving parts.
1
0
0
@bjoern_pl
Björn Plüster
2 months
While working on the agent that will help users navigate our platform, build reliable evals and analyze experiments, we got to a cool first version super quickly but getting it to work reliably across iteration cycles of the platform and the agent has proven difficult.
1
0
0
@bjoern_pl
Björn Plüster
2 months
All of these behaviors can be explained as subtle artifacts of imperfect rewards during RL training 🔎 Inline imports: likely a scaffold thing (files are read in chunks so edits are done where the model has read the file) but probably also a form of turn-reduction. If you can
@bjoern_pl
Björn Plüster
2 months
‼️PSA on common modes of bad code that codex / claude code produce that I've come across. Keep an eye out for these patterns to avoid getting shamed in code review.
0
2
5
@bjoern_pl
Björn Plüster
2 months
Have you also come across these? Are there any other recurring failure modes you've seen?
1
1
4