bjoern_pl Profile Banner
Björn Plüster Profile
Björn Plüster

@bjoern_pl

Followers
565
Following
360
Media
8
Statuses
122

Founder and CTO of ellamind. LLM and open-source enthusiast. @ellamindAI, @DiscoResearchAI

Joined September 2023
Don't wanna be here? Send us removal request.
@bjoern_pl
Björn Plüster
21 days
gpt-oss 120B is very blatantly incapable of producing linguistically correct german text. 🧵.
79
33
743
@bjoern_pl
Björn Plüster
6 days
Here's the blog post for who's interested: Also the link to the article by @bmftr_bund:
Tweet card summary image
laion.ai
We proudly introduce LeoLM (Linguistically Enhanced Open Language Mod...
0
0
2
@bjoern_pl
Björn Plüster
6 days
Proud to have been able to contribute to bringing more GPUs to the EU 🇪🇺. If your interested in making these GPUs get to good work - consider applying as a research engineer at @ellamindAI:
ellamind.jobs.personio.de
At ellamind, we don't just develop AI – we actively shape the future of Artificial Intelligence. As a cash-flow-positive startup, we have already earned the trust of renowned clients such as Deutsche...
1
0
3
@bjoern_pl
Björn Plüster
6 days
It's open source release also happened to be what got me in contact with my amazing co-founder @jphme who was at the time working on very similar ideas with his EM German models.
1
0
2
@bjoern_pl
Björn Plüster
6 days
LeoLM has since been an inspiration for many other projects (like our DiscoLM 8b, the @occiglot models, and more) and serves as a conceptual baseline for some ideas within the @OpenEuroLLM project to bring strong LLMs to all European languages.
1
0
1
@bjoern_pl
Björn Plüster
6 days
When I worked on this project, I had no references or publicly visible achievements and was still in my master's studies. Yet - @ChristophSchuh6 from @laion_ai had the intuition to trust in my idea and gave me the freedom and resources to pursue it, setting me up with.
1
0
1
@bjoern_pl
Björn Plüster
6 days
LeoLM is a project I am still super proud of to this day as one of the first successful attempts to scale up continued pre-training for language acquisition - a method widely used today for hundreds of other models and languages.
1
0
2
@bjoern_pl
Björn Plüster
6 days
Nearly two years after release my project LeoLM is being used as a strong justification for the expansion of federal compute funding in Germany. Goes to show how much impact open-source projects can have. Hell yeah @bmftr_bund - thanks for making projects like this possible! 🚀
Tweet media one
1
0
6
@bjoern_pl
Björn Plüster
19 days
RT @jphme: GPT-5 is worse than GPT-4o 😳 . at least for some writing tasks in German (and probably also other languages. ) 👇 https://t….
0
11
0
@bjoern_pl
Björn Plüster
20 days
I've said it before - Anthropic is the only frontier AI Company whose models are trained and maintained with love. I wish there was more information on the evals/testing/decision process behind these kinds of changes. We all could learn a lot.
@AmandaAskell
Amanda Askell
20 days
We made some updates to Claude’s system prompt in recently (developed in collaboration with Claude, of course). They aren’t set in stone and may be updated, but I’ll go through the current version of each and the reason behind it in this thread 🧵.
0
0
3
@bjoern_pl
Björn Plüster
21 days
RT @nikhilchandak29: New open-weights models came out from @OpenAI! . On GPQA-Diamond, it shows strong performance but is not better than o….
0
14
0
@bjoern_pl
Björn Plüster
21 days
PS.: All evals and vibe-checks were done on our LLM evaluation platform elluminate - more info on this soon. If evals, looking at the data, and analysis beyond generic benchmarks are something you're interested working on, we're hiring (Germany based):
2
1
18
@bjoern_pl
Björn Plüster
21 days
In summary, I see this as an exceptional release highlighting OpenAIs willingness to contribute to the open models space and showing how strong they actually are on model trainign. But it is also very clearly a model not up to their usual standards wrt. multilinguality or output.
1
2
27
@bjoern_pl
Björn Plüster
21 days
Overall gpt-oss seems to perform exceptionally well on more classic benchmarks. More testing to come but I can definitely see where the strong benchmark scores from their release stem from.
1
1
24
@bjoern_pl
Björn Plüster
21 days
One highlight from this eval shows something I've seen in a bunch of responses: the model tends to be very brief, responding with single words or short sentences instead of explaining itself. Here, without any explanation whatsoever, it responds with a single word after 20k
Tweet media one
1
1
20
@bjoern_pl
Björn Plüster
21 days
As to things the model is trained to do: on a generative variant of GPQA Diamond we're working on, gpt-oss 120B (high reasoning effort) outperforms Kimi K2 and Gemini 2.5 Pro while using ~8k tokens per response. Need to investigate this further.
Tweet media one
3
1
25
@bjoern_pl
Björn Plüster
21 days
Aside from the grammatical errors, the actual content of the responses is bad. An example of weirdness:."ich kann's kaum glauben: Mein neuestes Quantum‑Jump‑Device (aka „der Flitzer“, den ich eigentlich nur zum schnellen Kaffee holen benutzen wollte) hat mich mal wieder über den.
1
1
24
@bjoern_pl
Björn Plüster
21 days
For comparison Kimi K2, DeepSeek V3, Llama 3.3 70B all perform better with ~60% fully grammatically correct. Still far from perfect but worth mentioning.
1
1
48
@bjoern_pl
Björn Plüster
21 days
Here, the errors are less obvious but nonetheless bad, devolving into partial gibberish later on. An analysis from our evaluator: "The rhyme list includes “Schag” instead of “Schlag.” In section 6, phrases like “das gesamte Geschell geihnher” are nonsensical and appear to be
Tweet media one
1
1
32
@bjoern_pl
Björn Plüster
21 days
In this example the first two sentences are already utterly terrible. Couldn't be bothered to read the rest after that. "Die Nacht war ungewöhnlich still, als der Bibliothekar Sebastian Krause im schummrigen Lesesaal das Licht auswählte. Ein einziges Lampensymbol flackerte; die
Tweet media one
2
2
60
@bjoern_pl
Björn Plüster
21 days
The model not only fails to produce anything worth reading, but also makes egregious spelling errors. Examples following. The only viable conclusion (if the inference code isn't buggy) imo is that the model was explicitly not trained on german - likely only english.
2
1
85