Håvard Ihle @htihle X Profile

Håvard Ihle

@htihle

Followers

707

Following

77K

Media

84

Statuses

357

AI researcher, former cosmologist

https://t.co/2B3G37mZb9

Joined December 2010

Don't wanna be here? Send us removal request.

Håvard Ihle

@htihle

5 months

WeirdML v2 is now out! The update includes a bunch of new tasks (now 19 tasks total, up from 6), and results from all the latest models. We now also track api costs and other metadata which give more insight into the different models. The new results are shown in these two

Håvard Ihle

@htihle

10 months

Exited to share the results from WeirdML - a benchmark testing LLMs ability to solve weird and unusual machine learning tasks by writing working PyTorch code and iteratively learn from feedback.

8

12

96

Chase Brower

@ChaseBrowe32432

21 hours

@htihle In some sense it is precisely this (learning/improving from exploration efficiently) that really matters https://t.co/CFR572aadW

1

7

Håvard Ihle

@htihle

23 hours

One of the most striking thing about gemini-3-pro is how much better it is with several iterations. It makes better use of the information from the previous iterations than other models. After one iteration is is barely better than gpt-5.1, while after 5 it is almost 10pp ahead.

Håvard Ihle

@htihle

23 hours

gemini-3-pro takes a clear lead in WeirdML with 69.9%, achieving a new best individual score on 7 of the 17 tasks, and showing a clear step up in capability. Although there is still quite a way to go, models are now starting to reliably score well even on the difficult tasks.

4

8

130

Håvard Ihle

@htihle

23 hours

Important info about WeirdML! While looking at geminis results I figured out that there was a clearly false statement in the prompt to my two shuffle tasks. I did not notice it earlier since many models did well on the tasks by using approaches that did not rely on this

Håvard Ihle

@htihle

23 hours

gemini-3-pro takes a clear lead in WeirdML with 69.9%, achieving a new best individual score on 7 of the 17 tasks, and showing a clear step up in capability. Although there is still quite a way to go, models are now starting to reliably score well even on the difficult tasks.

4

56

Håvard Ihle

@htihle

23 hours

gemini-3-pro takes a clear lead in WeirdML with 69.9%, achieving a new best individual score on 7 of the 17 tasks, and showing a clear step up in capability. Although there is still quite a way to go, models are now starting to reliably score well even on the difficult tasks.

Håvard Ihle

@htihle

5 months

WeirdML v2 is now out! The update includes a bunch of new tasks (now 19 tasks total, up from 6), and results from all the latest models. We now also track api costs and other metadata which give more insight into the different models. The new results are shown in these two

5

10

145

Håvard Ihle

@htihle

2 days

gpt-5.1 is the new leader on WeirdML with 56.9%, beating out gpt-5 (at 56.3%). It uses a bit more tokens and does a bit better, but not a significant difference on these tasks.

Håvard Ihle

@htihle

5 months

WeirdML v2 is now out! The update includes a bunch of new tasks (now 19 tasks total, up from 6), and results from all the latest models. We now also track api costs and other metadata which give more insight into the different models. The new results are shown in these two

1

2

21

Peter Barnett

@peterbarnett_

2 days

We at the MIRI Technical Governance Team just put out a report describing an example international agreement to prevent the creation of superintelligence. 🧵

8

16

103

Chase Brower

@ChaseBrowe32432

2 days

Gemini 3 Pro (preview) scores 91% on VPCT (spatial reasoning) Uhhhh jesus christ

66

122

2K

Håvard Ihle

@htihle

2 days

gpt-5.1 is the new leader on WeirdML with 56.9%, beating out gpt-5 (at 56.3%). It uses a bit more tokens and does a bit better, but not a significant difference on these tasks.

Håvard Ihle

@htihle

5 months

WeirdML v2 is now out! The update includes a bunch of new tasks (now 19 tasks total, up from 6), and results from all the latest models. We now also track api costs and other metadata which give more insight into the different models. The new results are shown in these two

1

2

21

Håvard Ihle

@htihle

8 days

LLMs can score at least 85% on WeirdML, while gpt-5 only gets an average score of 56.3%! The rest is a question of reliably matching the best llm performances. Using the best individual score on each task, we find a lower bound on the maximum possible score on WeirdML, 85.2%. I

Håvard Ihle

@htihle

5 months

WeirdML v2 is now out! The update includes a bunch of new tasks (now 19 tasks total, up from 6), and results from all the latest models. We now also track api costs and other metadata which give more insight into the different models. The new results are shown in these two

0

1

26

Håvard Ihle

@htihle

9 days

I ran the model through novita on openrouter, as that was one of the providers with a good score on their vendor overview.

2

0

9

Håvard Ihle

@htihle

9 days

Kimi-k2-thinking is the best chinese model on WeirdML, scoring 42.1% and matching Opus 4,4.1 and grok-4. The results were very variable, with many mediocre scores on hard tasks and occasional really great scores. It also has a high failure rate on individual iterations,

Håvard Ihle

@htihle

5 months

WeirdML v2 is now out! The update includes a bunch of new tasks (now 19 tasks total, up from 6), and results from all the latest models. We now also track api costs and other metadata which give more insight into the different models. The new results are shown in these two

6

11

151

Rob Wiblin

@robertwiblin

23 days

OK the MOU between OpenAI and the California AG adds a lot of info here on what was likely pushed on them (a lot of good things): https://t.co/c4m7m6lghh "Mission-Only Fiduciary Duty (#8) "The PBC Certificate of Incorporation contains a provision consistent with Section 141(a)

Rob Wiblin

@robertwiblin

23 days

My questions (all not clear from their blog post): • Have the attorneys general approved this plan? • In what sense will the foundation 'remain in control' of the Public Benefit Corporation, other than the ability to hire and fire PBC directors? • What will the PBC do to

4

12

105

Zvi Mowshowitz

@TheZvi

23 days

For those asking 'why is it theft' I think this is the best single explanation, except OpenAI is now worth $500b+ so the case is even starker and bigger:

1

3

105

Zvi Mowshowitz

@TheZvi

23 days

They're moving to complete the greatest theft in human history (or perhaps second biggest, if you count what happened around the dissolution of the USSR). Are we going to let them get away with this?

OpenAI

@OpenAI

23 days

We completed our recapitalization. The non-profit, the OpenAI Foundation, is now one of the best resourced philanthropies ever, with equity valued at ~$130B. It continues to control the OpenAI for-profit, which is now a public benefit corporation. https://t.co/TevJDA3QwB

45

65

1K

Eliezer Yudkowsky

@allTheYud

25 days

Several commenters said that this was the interview of mine they found most helpful. (Chris Williamson let me give multi-minute answers to the questions that required them.)

27

20

292

Håvard Ihle

@htihle

28 days

I signed this statement! We should not build AI much smarter than ourselves before we understand what we are doing well enough to be very confident that the resulting ASI is aligned with our values/will create a future that is good. The fact that it is not really well defined

Max Tegmark

@tegmark

29 days

A stunningly broad coalition has come out against Skynet: AI researchers, faith leaders, business pioneers, policymakers, NatSec folks and actors stand together, from Bannon & Beck to Hinton, Wozniak & Prince Harry. We stand together because we want a human future.

0

2

Håvard Ihle

@htihle

1 month

Claude haiku 4.5 (non-thinking) scores 43.6% on WeirdML, beating all other small-tier models except ones from OpenAI. Looking at the bunching up of OpenAI models and Anthropic models, it seems that which training stack you are on is more important than the model size, which is

Håvard Ihle

@htihle

5 months

WeirdML v2 is now out! The update includes a bunch of new tasks (now 19 tasks total, up from 6), and results from all the latest models. We now also track api costs and other metadata which give more insight into the different models. The new results are shown in these two

2

3

56

Håvard Ihle

@htihle

1 month

Full blog post: https://t.co/judgnrdthM Here are some of the responses that made me laugh out loud.

0

2

Håvard Ihle

@htihle

1 month

Can LLMs coordinate? I instructed 5 leading LLMs to produce responses to 75 prompts that would exactly match the responses of the other models to the same prompts. The results were funny and interesting! Link to full blog post and some funny responses in the thread.

1

0

4