
Jan P. Harries
@jphme
Followers
1K
Following
4K
Media
168
Statuses
1K
Co-Founder & CEO @ ellamind / #DiscoResearch / Retweets&favs are stuff i find interesting, not endorsements
Düsseldorf, Germany
Joined March 2009
Live tweeting the most interesting insights from @Meta´s new Llama3 paper . 1. How did the arrive at a 405b model trained with ~15T tokens? ."Extrapolation of the resulting scaling law to 3.8 × 1025 FLOPs suggests training a 402B parameter model on 16.55T tokens." 👇🧵
3
88
842
embedding-based RAG (and vector dbs) becoming outdated technology as quickly as they gained popularity? 🤔.
We (1) show this theoretically. AND (2) show this holds empirically for vectors that are directly optimized on the test set (freely parameterized). This means it doesn't matter your training data/model -- embeddings can't solve these without a large enough embedding dim!.
0
0
0
models now they're living in a simulation nowadays. .
@woj_zaremba @AnthropicAI Really cool to see this! I hope this becomes standard practice. I feel like the main update for me from this research so far is that situational awareness seems to be reaching an important threshold:
0
0
0
Based on the cool post by @arithmoquine on LLMs drawing maps, I vibecoded a related eval. This is frontier stuff! 🤯.GPT-5 is leading the pack with 29% (5m reasoning tokens), Sonnet gets 5% (with 12k tokens) and Mistral Small 24b gets a full 1% 🧐. More details to follow.
0
0
0
RT @jphme: GPT-5 is worse than GPT-4o 😳 . at least for some writing tasks in German (and probably also other languages. ) 👇 https://t….
0
11
0
@ellamindAI what are your first impressions? @scaling01 @natolambert @bjoern_pl @WolframRvnwlf @rasdani_ @xlr8harder @JagersbergKnut.
3
0
4
Disclaimer: These are very early impressions on small sample sizes. All tests were run via @ellamindAI ´s Elluminate platform 😊.
1
0
2
We ran some of the same vibechecks as with the gpt-oss releases a few days ago (see @bjoern_pl `s thread). GPT-5 does similar errors as its open-source cousin and partially outputs worse German than its ancestors. Does this show the limitations of synthetic pre-train data?
1
1
0
Results for our MMLU-ProX-DE-openended variant (free-form answering) are quite impressive though 👇. my current verdict: gpt-oss is great at benchmarks and STEM tasks (eg raw reasoning power) but also hallucinates, writes badly, can't write good non-english.
gpt-oss 120B is very blatantly incapable of producing linguistically correct german text. 🧵.
0
0
2
More details on the disappointing multilingual performance of gpt-oss. Worse than Llama 3.3 70 b!.Inference bugs or spikey, almost monolingual model? 🤔👇.
The model not only fails to produce anything worth reading, but also makes egregious spelling errors. Examples following. The only viable conclusion (if the inference code isn't buggy) imo is that the model was explicitly not trained on german - likely only english.
0
1
4
RT @jphme: @flozi00 @ellamindAI on one of our German writing benchmarks much worse than Kimi 😕 (cc @bjoern_pl ).
0
1
0
hopefully the wait was worth it, lfg 🙌.(and I still believe it's something personally important to @sama having heard him talk about it. ).
this'll be a big week for American momentum with open models, multiple things are going to start to fall into place.
0
0
1
But no clear pattern with regards to task difficulty - this was estimated to be P4<P2<P5<P3<P1≪P6 (by @rfurmaniak ) - both submissions extremely strong and hard to judge quality differences.
0
0
1