Håvard Ihle Profile
Håvard Ihle

@htihle

Followers
707
Following
77K
Media
84
Statuses
357

AI researcher, former cosmologist

Joined December 2010
Don't wanna be here? Send us removal request.
@htihle
Håvard Ihle
5 months
WeirdML v2 is now out! The update includes a bunch of new tasks (now 19 tasks total, up from 6), and results from all the latest models. We now also track api costs and other metadata which give more insight into the different models. The new results are shown in these two
@htihle
Håvard Ihle
10 months
Exited to share the results from WeirdML - a benchmark testing LLMs ability to solve weird and unusual machine learning tasks by writing working PyTorch code and iteratively learn from feedback.
8
12
96
@ChaseBrowe32432
Chase Brower
21 hours
@htihle In some sense it is precisely this (learning/improving from exploration efficiently) that really matters https://t.co/CFR572aadW
1
1
7
@htihle
Håvard Ihle
23 hours
One of the most striking thing about gemini-3-pro is how much better it is with several iterations. It makes better use of the information from the previous iterations than other models. After one iteration is is barely better than gpt-5.1, while after 5 it is almost 10pp ahead.
@htihle
Håvard Ihle
23 hours
gemini-3-pro takes a clear lead in WeirdML with 69.9%, achieving a new best individual score on 7 of the 17 tasks, and showing a clear step up in capability. Although there is still quite a way to go, models are now starting to reliably score well even on the difficult tasks.
4
8
130
@htihle
Håvard Ihle
23 hours
Important info about WeirdML! While looking at geminis results I figured out that there was a clearly false statement in the prompt to my two shuffle tasks. I did not notice it earlier since many models did well on the tasks by using approaches that did not rely on this
@htihle
Håvard Ihle
23 hours
gemini-3-pro takes a clear lead in WeirdML with 69.9%, achieving a new best individual score on 7 of the 17 tasks, and showing a clear step up in capability. Although there is still quite a way to go, models are now starting to reliably score well even on the difficult tasks.
4
4
56
@htihle
Håvard Ihle
23 hours
gemini-3-pro takes a clear lead in WeirdML with 69.9%, achieving a new best individual score on 7 of the 17 tasks, and showing a clear step up in capability. Although there is still quite a way to go, models are now starting to reliably score well even on the difficult tasks.
@htihle
Håvard Ihle
5 months
WeirdML v2 is now out! The update includes a bunch of new tasks (now 19 tasks total, up from 6), and results from all the latest models. We now also track api costs and other metadata which give more insight into the different models. The new results are shown in these two
5
10
145
@htihle
Håvard Ihle
2 days
gpt-5.1 is the new leader on WeirdML with 56.9%, beating out gpt-5 (at 56.3%). It uses a bit more tokens and does a bit better, but not a significant difference on these tasks.
@htihle
Håvard Ihle
5 months
WeirdML v2 is now out! The update includes a bunch of new tasks (now 19 tasks total, up from 6), and results from all the latest models. We now also track api costs and other metadata which give more insight into the different models. The new results are shown in these two
1
2
21
@peterbarnett_
Peter Barnett
2 days
We at the MIRI Technical Governance Team just put out a report describing an example international agreement to prevent the creation of superintelligence. 🧵
8
16
103
@ChaseBrowe32432
Chase Brower
2 days
Gemini 3 Pro (preview) scores 91% on VPCT (spatial reasoning) Uhhhh jesus christ
66
122
2K
@htihle
Håvard Ihle
2 days
gpt-5.1 is the new leader on WeirdML with 56.9%, beating out gpt-5 (at 56.3%). It uses a bit more tokens and does a bit better, but not a significant difference on these tasks.
@htihle
Håvard Ihle
5 months
WeirdML v2 is now out! The update includes a bunch of new tasks (now 19 tasks total, up from 6), and results from all the latest models. We now also track api costs and other metadata which give more insight into the different models. The new results are shown in these two
1
2
21
@htihle
Håvard Ihle
8 days
LLMs can score at least 85% on WeirdML, while gpt-5 only gets an average score of 56.3%! The rest is a question of reliably matching the best llm performances. Using the best individual score on each task, we find a lower bound on the maximum possible score on WeirdML, 85.2%. I
@htihle
Håvard Ihle
5 months
WeirdML v2 is now out! The update includes a bunch of new tasks (now 19 tasks total, up from 6), and results from all the latest models. We now also track api costs and other metadata which give more insight into the different models. The new results are shown in these two
0
1
26
@htihle
Håvard Ihle
9 days
I ran the model through novita on openrouter, as that was one of the providers with a good score on their vendor overview.
2
0
9
@htihle
Håvard Ihle
9 days
Kimi-k2-thinking is the best chinese model on WeirdML, scoring 42.1% and matching Opus 4,4.1 and grok-4. The results were very variable, with many mediocre scores on hard tasks and occasional really great scores. It also has a high failure rate on individual iterations,
@htihle
Håvard Ihle
5 months
WeirdML v2 is now out! The update includes a bunch of new tasks (now 19 tasks total, up from 6), and results from all the latest models. We now also track api costs and other metadata which give more insight into the different models. The new results are shown in these two
6
11
151
@robertwiblin
Rob Wiblin
23 days
OK the MOU between OpenAI and the California AG adds a lot of info here on what was likely pushed on them (a lot of good things): https://t.co/c4m7m6lghh "Mission-Only Fiduciary Duty (#8) "The PBC Certificate of Incorporation contains a provision consistent with Section 141(a)
@robertwiblin
Rob Wiblin
23 days
My questions (all not clear from their blog post): • Have the attorneys general approved this plan? • In what sense will the foundation 'remain in control' of the Public Benefit Corporation, other than the ability to hire and fire PBC directors? • What will the PBC do to
4
12
105
@TheZvi
Zvi Mowshowitz
23 days
For those asking 'why is it theft' I think this is the best single explanation, except OpenAI is now worth $500b+ so the case is even starker and bigger:
1
3
105
@TheZvi
Zvi Mowshowitz
23 days
They're moving to complete the greatest theft in human history (or perhaps second biggest, if you count what happened around the dissolution of the USSR). Are we going to let them get away with this?
@OpenAI
OpenAI
23 days
We completed our recapitalization. The non-profit, the OpenAI Foundation, is now one of the best resourced philanthropies ever, with equity valued at ~$130B. It continues to control the OpenAI for-profit, which is now a public benefit corporation. https://t.co/TevJDA3QwB
45
65
1K
@allTheYud
Eliezer Yudkowsky
25 days
Several commenters said that this was the interview of mine they found most helpful. (Chris Williamson let me give multi-minute answers to the questions that required them.)
27
20
292
@htihle
Håvard Ihle
28 days
I signed this statement! We should not build AI much smarter than ourselves before we understand what we are doing well enough to be very confident that the resulting ASI is aligned with our values/will create a future that is good. The fact that it is not really well defined
@tegmark
Max Tegmark
29 days
A stunningly broad coalition has come out against Skynet: AI researchers, faith leaders, business pioneers, policymakers, NatSec folks and actors stand together, from Bannon & Beck to Hinton, Wozniak & Prince Harry. We stand together because we want a human future.
0
0
2
@htihle
Håvard Ihle
1 month
Claude haiku 4.5 (non-thinking) scores 43.6% on WeirdML, beating all other small-tier models except ones from OpenAI. Looking at the bunching up of OpenAI models and Anthropic models, it seems that which training stack you are on is more important than the model size, which is
@htihle
Håvard Ihle
5 months
WeirdML v2 is now out! The update includes a bunch of new tasks (now 19 tasks total, up from 6), and results from all the latest models. We now also track api costs and other metadata which give more insight into the different models. The new results are shown in these two
2
3
56
@htihle
Håvard Ihle
1 month
Full blog post: https://t.co/judgnrdthM Here are some of the responses that made me laugh out loud.
0
0
2
@htihle
Håvard Ihle
1 month
Can LLMs coordinate? I instructed 5 leading LLMs to produce responses to 75 prompts that would exactly match the responses of the other models to the same prompts. The results were funny and interesting! Link to full blog post and some funny responses in the thread.
1
0
4