Jan P. Harries @jphme X Profile

Jan P. Harries

@jphme

Followers

1K

Following

4K

Media

168

Statuses

1K

Co-Founder & CEO @ ellamind / #DiscoResearch / Retweets&favs are stuff i find interesting, not endorsements

Düsseldorf, Germany

Joined March 2009

Don't wanna be here? Send us removal request.

Jan P. Harries

@jphme

1 year

Live tweeting the most interesting insights from @Meta´s new Llama3 paper . 1. How did the arrive at a 405b model trained with ~15T tokens? ."Extrapolation of the resulting scaling law to 3.8 × 1025 FLOPs suggests training a 402B parameter model on 16.55T tokens." 👇🧵

3

88

842

Jan P. Harries

@jphme

3 days

embedding-based RAG (and vector dbs) becoming outdated technology as quickly as they gained popularity? 🤔.

Orion Weller

@orionweller

4 days

We (1) show this theoretically. AND (2) show this holds empirically for vectors that are directly optimized on the test set (freely parameterized). This means it doesn't matter your training data/model -- embeddings can't solve these without a large enough embedding dim!.

0

Grok

@grok

6 days

Join millions who have switched to Grok.

251

501

4K

Jan P. Harries

@jphme

5 days

models now they're living in a simulation nowadays. .

Daniel Kokotajlo

@DKokotajlo

6 days

@woj_zaremba @AnthropicAI Really cool to see this! I hope this becomes standard practice. I feel like the main update for me from this research so far is that situational awareness seems to be reaching an important threshold:

0

Jan P. Harries

@jphme

21 days

Based on the cool post by @arithmoquine on LLMs drawing maps, I vibecoded a related eval. This is frontier stuff! 🤯.GPT-5 is leading the pack with 29% (5m reasoning tokens), Sonnet gets 5% (with 12k tokens) and Mistral Small 24b gets a full 1% 🧐. More details to follow.

0

Jan P. Harries

@jphme

26 days

RT @jphme: GPT-5 is worse than GPT-4o 😳 . at least for some writing tasks in German (and probably also other languages. ) 👇 https://t….

0

11

0

Jan P. Harries

@jphme

26 days

@ellamindAI what are your first impressions? @scaling01 @natolambert @bjoern_pl @WolframRvnwlf @rasdani_ @xlr8harder @JagersbergKnut.

3

0

4

Jan P. Harries

@jphme

26 days

Disclaimer: These are very early impressions on small sample sizes. All tests were run via @ellamindAI ´s Elluminate platform 😊.

1

0

2

Jan P. Harries

@jphme

26 days

However, similar to gpt-oss 120B it performs excellent in "its domain", namely STEM-related tasks - even if challenged in a multilingual, open-ended format 👍

1

4

Jan P. Harries

@jphme

26 days

We ran some of the same vibechecks as with the gpt-oss releases a few days ago (see @bjoern_pl `s thread). GPT-5 does similar errors as its open-source cousin and partially outputs worse German than its ancestors. Does this show the limitations of synthetic pre-train data?

1

0

Jan P. Harries

@jphme

26 days

GPT-5 is worse than GPT-4o 😳 . at least for some writing tasks in German (and probably also other languages. ) 👇

6

11

50

Jan P. Harries

@jphme

28 days

Results for our MMLU-ProX-DE-openended variant (free-form answering) are quite impressive though 👇. my current verdict: gpt-oss is great at benchmarks and STEM tasks (eg raw reasoning power) but also hallucinates, writes badly, can't write good non-english.

Björn Plüster

@bjoern_pl

28 days

gpt-oss 120B is very blatantly incapable of producing linguistically correct german text. 🧵.

0

2

Jan P. Harries

@jphme

28 days

More details on the disappointing multilingual performance of gpt-oss. Worse than Llama 3.3 70 b!.Inference bugs or spikey, almost monolingual model? 🤔👇.

Björn Plüster

@bjoern_pl

28 days

The model not only fails to produce anything worth reading, but also makes egregious spelling errors. Examples following. The only viable conclusion (if the inference code isn't buggy) imo is that the model was explicitly not trained on german - likely only english.

0

1

4

Jan P. Harries

@jphme

28 days

RT @jphme: @flozi00 @ellamindAI on one of our German writing benchmarks much worse than Kimi 😕 (cc @bjoern_pl ).

0

1

0

Jan P. Harries

@jphme

28 days

RT @ellamindAI: gpt-oss already being tested on Elluminate 👀🔥

0

1

0

Jan P. Harries

@jphme

28 days

multilingual capabilities also seem to be lacking behind the proprietary models somewhat (although MMMLU is not very relevant imho), with even the 120b being worse than o3-mini on low reasoning.

0

Jan P. Harries

@jphme

28 days

The log-linear scaling between reasoning effort and GPQA results in @OpenAI's new OSS models looks almost too good to be true (and way better than eg for qwen). they still have some secret sauce 🔥

0

Jan P. Harries

@jphme

28 days

Apache 2.0 and competitive performance ❤️. knew it, thanks @sama & team!.

Jan P. Harries

@jphme

1 month

hopefully the wait was worth it, lfg 🙌.(and I still believe it's something personally important to @sama having heard him talk about it. ).

0

Jan P. Harries

@jphme

1 month

hopefully the wait was worth it, lfg 🙌.(and I still believe it's something personally important to @sama having heard him talk about it. ).

Nathan Lambert

@natolambert

1 month

this'll be a big week for American momentum with open models, multiple things are going to start to fall into place.

0

1

Jan P. Harries

@jphme

1 month

But no clear pattern with regards to task difficulty - this was estimated to be P4<P2<P5<P3<P1≪P6 (by @rfurmaniak ) - both submissions extremely strong and hard to judge quality differences.

0

1

Jan P. Harries

@jphme

1 month

Interesting pattern 👀: When models agreed, they genuinely recognized superior solutions regardless of origin. But on problems 1&5, we see classic "my work is better" bias - even in AI systems! Shows importance of neutral third-party evaluation in AI benchmarks.

1

0

3

Jan P. Harries

@jphme

1 month

I asked them to focus on Explanatory Power, Elementarity and Creativity (as well as other relevant categories if applicable) and disregard formatting/(writing) style differences. (Example verdict by Gemini 2.5 pro for Problem 5, Solution 1 is OpenAI´s) 👇

1

0

1