jphme Profile Banner
Jan P. Harries Profile
Jan P. Harries

@jphme

Followers
1K
Following
4K
Media
168
Statuses
1K

Co-Founder & CEO @ ellamind / #DiscoResearch / Retweets&favs are stuff i find interesting, not endorsements

Düsseldorf, Germany
Joined March 2009
Don't wanna be here? Send us removal request.
@jphme
Jan P. Harries
1 year
Live tweeting the most interesting insights from @Meta´s new Llama3 paper . 1. How did the arrive at a 405b model trained with ~15T tokens? ."Extrapolation of the resulting scaling law to 3.8 × 1025 FLOPs suggests training a 402B parameter model on 16.55T tokens." 👇🧵
Tweet media one
3
88
842
@jphme
Jan P. Harries
3 days
embedding-based RAG (and vector dbs) becoming outdated technology as quickly as they gained popularity? 🤔.
@orionweller
Orion Weller
4 days
We (1) show this theoretically. AND (2) show this holds empirically for vectors that are directly optimized on the test set (freely parameterized). This means it doesn't matter your training data/model -- embeddings can't solve these without a large enough embedding dim!.
0
0
0
@grok
Grok
6 days
Join millions who have switched to Grok.
251
501
4K
@jphme
Jan P. Harries
5 days
models now they're living in a simulation nowadays. .
@DKokotajlo
Daniel Kokotajlo
6 days
@woj_zaremba @AnthropicAI Really cool to see this! I hope this becomes standard practice. I feel like the main update for me from this research so far is that situational awareness seems to be reaching an important threshold:
Tweet media one
0
0
0
@jphme
Jan P. Harries
21 days
Based on the cool post by @arithmoquine on LLMs drawing maps, I vibecoded a related eval. This is frontier stuff! 🤯.GPT-5 is leading the pack with 29% (5m reasoning tokens), Sonnet gets 5% (with 12k tokens) and Mistral Small 24b gets a full 1% 🧐. More details to follow.
Tweet media one
0
0
0
@jphme
Jan P. Harries
26 days
RT @jphme: GPT-5 is worse than GPT-4o 😳 . at least for some writing tasks in German (and probably also other languages. ) 👇 https://t….
0
11
0
@jphme
Jan P. Harries
26 days
3
0
4
@jphme
Jan P. Harries
26 days
Disclaimer: These are very early impressions on small sample sizes. All tests were run via @ellamindAI ´s Elluminate platform 😊.
1
0
2
@jphme
Jan P. Harries
26 days
However, similar to gpt-oss 120B it performs excellent in "its domain", namely STEM-related tasks - even if challenged in a multilingual, open-ended format 👍
Tweet media one
1
1
4
@jphme
Jan P. Harries
26 days
We ran some of the same vibechecks as with the gpt-oss releases a few days ago (see @bjoern_pl `s thread). GPT-5 does similar errors as its open-source cousin and partially outputs worse German than its ancestors. Does this show the limitations of synthetic pre-train data?
Tweet media one
1
1
0
@jphme
Jan P. Harries
26 days
GPT-5 is worse than GPT-4o 😳 . at least for some writing tasks in German (and probably also other languages. ) 👇
Tweet media one
6
11
50
@jphme
Jan P. Harries
28 days
Results for our MMLU-ProX-DE-openended variant (free-form answering) are quite impressive though 👇. my current verdict: gpt-oss is great at benchmarks and STEM tasks (eg raw reasoning power) but also hallucinates, writes badly, can't write good non-english.
Tweet media one
Tweet media two
Tweet media three
@bjoern_pl
Björn Plüster
28 days
gpt-oss 120B is very blatantly incapable of producing linguistically correct german text. 🧵.
0
0
2
@jphme
Jan P. Harries
28 days
More details on the disappointing multilingual performance of gpt-oss. Worse than Llama 3.3 70 b!.Inference bugs or spikey, almost monolingual model? 🤔👇.
@bjoern_pl
Björn Plüster
28 days
The model not only fails to produce anything worth reading, but also makes egregious spelling errors. Examples following. The only viable conclusion (if the inference code isn't buggy) imo is that the model was explicitly not trained on german - likely only english.
0
1
4
@jphme
Jan P. Harries
28 days
RT @jphme: @flozi00 @ellamindAI on one of our German writing benchmarks much worse than Kimi 😕 (cc @bjoern_pl ).
0
1
0
@jphme
Jan P. Harries
28 days
RT @ellamindAI: gpt-oss already being tested on Elluminate 👀🔥
Tweet media one
0
1
0
@jphme
Jan P. Harries
28 days
multilingual capabilities also seem to be lacking behind the proprietary models somewhat (although MMMLU is not very relevant imho), with even the 120b being worse than o3-mini on low reasoning.
Tweet media one
0
0
0
@jphme
Jan P. Harries
28 days
The log-linear scaling between reasoning effort and GPQA results in @OpenAI's new OSS models looks almost too good to be true (and way better than eg for qwen). they still have some secret sauce 🔥
Tweet media one
0
0
0
@jphme
Jan P. Harries
28 days
Apache 2.0 and competitive performance ❤️. knew it, thanks @sama & team!.
@jphme
Jan P. Harries
1 month
hopefully the wait was worth it, lfg 🙌.(and I still believe it's something personally important to @sama having heard him talk about it. ).
0
0
0
@jphme
Jan P. Harries
1 month
hopefully the wait was worth it, lfg 🙌.(and I still believe it's something personally important to @sama having heard him talk about it. ).
@natolambert
Nathan Lambert
1 month
this'll be a big week for American momentum with open models, multiple things are going to start to fall into place.
0
0
1
@jphme
Jan P. Harries
1 month
But no clear pattern with regards to task difficulty - this was estimated to be P4<P2<P5<P3<P1≪P6 (by @rfurmaniak ) - both submissions extremely strong and hard to judge quality differences.
0
0
1
@jphme
Jan P. Harries
1 month
Interesting pattern 👀: When models agreed, they genuinely recognized superior solutions regardless of origin. But on problems 1&5, we see classic "my work is better" bias - even in AI systems! Shows importance of neutral third-party evaluation in AI benchmarks.
1
0
3
@jphme
Jan P. Harries
1 month
I asked them to focus on Explanatory Power, Elementarity and Creativity (as well as other relevant categories if applicable) and disregard formatting/(writing) style differences. (Example verdict by Gemini 2.5 pro for Problem 5, Solution 1 is OpenAI´s) 👇
Tweet media one
1
0
1