Philipp Schoenegger
@SchoeneggerPhil
Followers
2K
Following
17K
Media
583
Statuses
3K
@Microsoft AI Futures
London, England
Joined April 2014
In a @metaculus forecasting tournament ten human pros and 96 bots participated. Each human beat all the bots. The big drop-off in the chart below is between the last human and the first bot.
7
23
208
All of today's @Copilot announcements boil down to one core idea: we're betting on humanist AI. An AI that always puts humans first. - Copilot Groups - AI browser - our new character Mico - memory updates - Copilot for health + more in this morning's event https://t.co/GNtBAC8Nh6
128
178
1K
Meet our third @MicrosoftAI model: MAI-Image-1 #9 on LMArena, striking an impressive balance of generation speed and quality Excited to keep refining + climbing the leaderboard from here! We're just getting started. https://t.co/33BiNfIjPg
36
82
510
Is AI on track to match top human forecasters at predicting the future? Today, FRI is releasing an update to ForecastBench—our benchmark that tracks how accurate LLMs are at forecasting real-world events. A trend extrapolation of our results suggests LLMs will reach
8
29
118
Some personal news! Next month I will be joining Microsoft AI, working on the economic effects of advanced AI. After an amazing time at LSE, I'm really excited to contribute to this important area of research at Microsoft during such a pivotal moment for AI!
5
1
72
Prompt engineering has negligible and sometimes negative effects on models' ability to forecast. I feel like this reflects decreasing benefit of prompt engineering as models get more sophisticated, but it could simply mean that we haven't yet discovered a good forecasting prompt.
New preprint with @CamrobJones @PTetlock and Mellers! We test how much prompt engineering can impact LLM forecasting capabilities of o1, o1-mini, 4o, Sonnet, Haiku & Llama, finding that simple-to-moderate prompt engineering has little or no effect, with some prompts backfiring!
0
1
5
Coupled with other results like @SchoeneggerPhil's other recent paper on the effectiveness of RL for improving forecasting it seems like general prompting might not be the most promising place to push atm.
arxiv.org
Reinforcement Learning with Verifiable Rewards (RLVR) has been an effective approach for improving Large Language Models' reasoning in domains such as coding and mathematics. Here, we apply RLVR...
1
1
4
Really enjoyed working on this with @SchoeneggerPhil @PTetlock and Barbara Mellers. We tried ~50 prompt techniques (including AI classics and more theory-motivated ones) on 100 forecasting questions across 6 LLMs. No prompt showed robust improvements! https://t.co/NTnkgrWhPO
arxiv.org
Large language model performance can be improved in a large number of ways. Many such techniques, like fine-tuning or advanced tool usage, are time-intensive and expensive. Although prompt...
New preprint with @CamrobJones @PTetlock and Mellers! We test how much prompt engineering can impact LLM forecasting capabilities of o1, o1-mini, 4o, Sonnet, Haiku & Llama, finding that simple-to-moderate prompt engineering has little or no effect, with some prompts backfiring!
1
8
16
Overall, our preregistered analyses suggest that low-to-medium effort prompt engineering is unlikely to be an especially effective way of improving forecasting performance of current LLMs. Full paper with all prompts in the appendix:
arxiv.org
Large language model performance can be improved in a large number of ways. Many such techniques, like fine-tuning or advanced tool usage, are time-intensive and expensive. Although prompt...
0
0
4
A few specific findings: 1) Frequency-based reasoning, base-rate first, and CoT prompts do improve accuracy (slightly). 2) Simple Bayesian reasoning prompts surprisingly backfire. 3) More elaborate prompts have similarly small effects, though some prompts still backfire.
1
1
4
New preprint with @CamrobJones @PTetlock and Mellers! We test how much prompt engineering can impact LLM forecasting capabilities of o1, o1-mini, 4o, Sonnet, Haiku & Llama, finding that simple-to-moderate prompt engineering has little or no effect, with some prompts backfiring!
3
10
36
This study is important because it showed that AI outperforms incentivized humans when persuading people of both true and false claims Persuasion has always been an extremely valuable skill in business, politics, and most other competitive domains https://t.co/0jLClKXrdr
arxiv.org
We directly compare the persuasion capabilities of a frontier large language model (LLM; Claude Sonnet 3.5) against incentivized human persuaders in an interactive, real-time conversational quiz...
2
4
23
💰💰💰Prediction markets are going to get weird. Now we have a smallish open source LLM (14B) that can be trained to predict messy real-world outcomes better than GPT o1. ~thread~
arxiv.org
Reinforcement Learning with Verifiable Rewards (RLVR) has been an effective approach for improving Large Language Models' reasoning in domains such as coding and mathematics. Here, we apply RLVR...
1
1
2
Outcome-Based Reinforcement Learning to Predict the Future https://t.co/P62etMNyBe ( https://t.co/kVbttuoSy6)
arxiv.org
Reinforcement Learning with Verifiable Rewards (RLVR) has been an effective approach for improving Large Language Models' reasoning in domains such as coding and mathematics. Here, we apply RLVR...
0
1
1
Interesting work here on a 14b LLM for PM forecasting 👇 Check the adapted RL in particular. Nice results on calibration error. This open the door for production tools in the domain imo. Bravo to the team ! 👏 2nd interesting work in a few months on this underrated topic.
New preprint! We trained a 14B LLM on 110k yes/no events with outcome-only RL (ReMax + GRPO tweaks), matching frontier model o1 on accuracy and halving its calibration error, yielding a hypothetical $127 vs $92 profit (+$35).
0
1
3
Co-authored with: @BTurtel, @DanTheManDev, Kris Skotheim, and @lukebeehewitt Paper: https://t.co/tOCNT1Wo9f
0
0
1
New preprint! We trained a 14B LLM on 110k yes/no events with outcome-only RL (ReMax + GRPO tweaks), matching frontier model o1 on accuracy and halving its calibration error, yielding a hypothetical $127 vs $92 profit (+$35).
6
6
23
🚨 New preprint from @lightningrodai, in collaboration with @SchoeneggerPhil & @lukebeehewitt 🚨 We trained a compact reasoning model that's state-of-the-art at predicting the future. We massively outperform frontier models at prediction market betting, despite being a fraction
1
5
12
I also have another preprint out with @SchoeneggerPhil et al. showing similar results on Claude Sonnet 3.5 in interactive quizzes with highly incentivised humans, both in truthful and deceptive persuasion. More on this at:
New preprint out with an amazing 40-person team! We find that Claude 3.5 Sonnet outperforms incentivised human persuaders in a >1000-participant live quiz-chat in deceptive and truthful directions!
0
1
2
Proud to be a small part of this fantastic team -- check out our pre-print on #LLM #persuasion at https://t.co/zAmqQkbumY.
arxiv.org
We directly compare the persuasion capabilities of a frontier large language model (LLM; Claude Sonnet 3.5) against incentivized human persuaders in an interactive, real-time conversational quiz...
New preprint out with an amazing 40-person team! We find that Claude 3.5 Sonnet outperforms incentivised human persuaders in a >1000-participant live quiz-chat in deceptive and truthful directions!
0
1
4