Björn Plüster @bjoern_pl X Profile

Björn Plüster

@bjoern_pl

Followers

603

Following

448

Media

21

Statuses

178

Founder and CTO of ellamind. LLM and open-source enthusiast. @ellamindAI, @DiscoResearchAI

https://t.co/w9YejxuadF

Joined September 2023

Don't wanna be here? Send us removal request.

Björn Plüster

@bjoern_pl

2 months

Evals and test cases for specific tools and tasks are exactly what is needed to engineer reliable agents. Not necessary for a first PoC but once you want to iterate on prompts and tool definitions, combine multiple tasks or choose the right model they become essential.

Anthropic

@AnthropicAI

2 months

New on the Anthropic Engineering blog: writing effective tools for LLM agents. AI agents are only as powerful as the tools we give them. So how do we make those tools more effective? We share our best tips for developers:

1

0

2

Shayne Longpre

@ShayneRedford

9 days

📢Thrilled to introduce ATLAS 🗺️: scaling laws beyond English, for pretraining, finetuning, and the curse of multilinguality. The largest public, multilingual scaling study to-date—we ran 774 exps (10M-8B params, 400+ languages) to answer: 🌍Are scaling laws different by

6

37

132

Björn Plüster

@bjoern_pl

10 days

If you're in Germany and you can ship - dm me

Guillermo Rauch

@rauchg

11 days

Learn to ship. Shipping is a skill distinct from coding. Shipping is designing, coding, QAing, story-telling, teaching, marketing, selling, pivoting, iterating… It used to be that coding dominated in importance because of coding ability scarcity. AI will push you to go further.

0

Björn Plüster

@bjoern_pl

11 days

New variant of gpt-5-thinking-mini? Note i did not ask for any specific model in my prompt so maybe it's from the system prompt?

1

0

Björn Plüster

@bjoern_pl

11 days

OpenAI accidentally leaking access to new alpha model in chatgpt? 🤔 "chatgpt_alpha_model_external_access_reserved_gate_13" Anything special I should test?

1

0

Björn Plüster

@bjoern_pl

15 days

If you are in Germany and have a taste for evals - dm me!

Garry Tan

@garrytan

16 days

Prompting and evals will be everything Prompting is agency Evals is taste

0

2

Garry Tan

@garrytan

16 days

Prompting and evals will be everything Prompting is agency Evals is taste

222

166

2K

Björn Plüster

@bjoern_pl

20 days

entropix but with benchmarks to back it up 👀

Aayush Karan

@aakaran31

20 days

We found a new way to get language models to reason. 🤯 No RL, no training, no verifiers, no prompting. ❌ With better sampling, base models can achieve single-shot reasoning on par with (or better than!) GRPO while avoiding its characteristic loss in generation diversity.

0

2

Björn Plüster

@bjoern_pl

24 days

Come find out more about the tech behind why @bfl_ml is targeting 4B$ valuation 👇

Jan P. Harries

@jphme

24 days

AIDev 5 featuring talks by @Nils_Reimers (@cohere; on the future of search for AI/agents) and Stephen Batifol of @bfl_ml on FLUX´s image editing capabilities & others will take place on Wed, Oct. 29th in Frankfurt - register now, it´s free 👇

0

1

Björn Plüster

@bjoern_pl

1 month

me when prompting claude

wilson hou

@wilsonhou

1 month

details are care @pacocoursey @raunofreiberg

0

1

Björn Plüster

@bjoern_pl

1 month

@OpenAI Also they released a subset on huggingface: https://t.co/FNwABxtbNS Very cool 😎

huggingface.co

0

Björn Plüster

@bjoern_pl

1 month

Kudos to @OpenAI again for releasing the results of interesting and valuable benchmarks even if their models don't top the leaderboard. Love the integrity!

1

0

Björn Plüster

@bjoern_pl

1 month

Claude Opus outperforms GPT-5 high by a significant margin on diverse economically valuable tasks. Honestly the distance between them comes as a surprise to me but it also shows how Claude's more nuanced, less robotic personality is so important for many tasks.

1

0

2

Björn Plüster

@bjoern_pl

2 months

Evals don't have to slow you down - they can drastically improve pace, give devs a hill to climb and reduce failures in production.

0

1

Björn Plüster

@bjoern_pl

2 months

Our MCP and the agent itself will be a key component to enable you to ship quality ai products. Building good evals at the speed of your ai development is key to reducing the drag while maintaining user trust.

1

0

Björn Plüster

@bjoern_pl

2 months

We dog-food our platform to improve the agent and can use this improve our product on both dimensions - evaluating complex agents and building the agent that helps you throughout the process.

1

0

Björn Plüster

@bjoern_pl

2 months

By running evals with realistic test cases on reproducible environments, we can find issues with our tools, reduce failures and ship with confidence, even in a rapidly developing product with many moving parts.

1

0

Björn Plüster

@bjoern_pl

2 months

While working on the agent that will help users navigate our platform, build reliable evals and analyze experiments, we got to a cool first version super quickly but getting it to work reliably across iteration cycles of the platform and the agent has proven difficult.

1

0

Björn Plüster

@bjoern_pl

2 months

All of these behaviors can be explained as subtle artifacts of imperfect rewards during RL training 🔎 Inline imports: likely a scaffold thing (files are read in chunks so edits are done where the model has read the file) but probably also a form of turn-reduction. If you can

Björn Plüster

@bjoern_pl

2 months

‼️PSA on common modes of bad code that codex / claude code produce that I've come across. Keep an eye out for these patterns to avoid getting shamed in code review.

0

2

5

Björn Plüster

@bjoern_pl

2 months

Have you also come across these? Are there any other recurring failure modes you've seen?

1

4