SchoeneggerPhil Profile Banner
Philipp Schoenegger Profile
Philipp Schoenegger

@SchoeneggerPhil

Followers
2K
Following
17K
Media
583
Statuses
3K

@Microsoft AI Futures

London, England
Joined April 2014
Don't wanna be here? Send us removal request.
@StefanFSchubert
Stefan Schubert
2 days
In a @metaculus forecasting tournament ten human pros and 96 bots participated. Each human beat all the bots. The big drop-off in the chart below is between the last human and the first bot.
7
23
208
@mustafasuleyman
Mustafa Suleyman
7 days
All of today's @Copilot announcements boil down to one core idea: we're betting on humanist AI. An AI that always puts humans first. - Copilot Groups - AI browser - our new character Mico - memory updates - Copilot for health + more in this morning's event https://t.co/GNtBAC8Nh6
128
178
1K
@mustafasuleyman
Mustafa Suleyman
16 days
Meet our third @MicrosoftAI model: MAI-Image-1 #9 on LMArena, striking an impressive balance of generation speed and quality Excited to keep refining + climbing the leaderboard from here! We're just getting started. https://t.co/33BiNfIjPg
36
82
510
@Research_FRI
Forecasting Research Institute
22 days
Is AI on track to match top human forecasters at predicting the future? Today, FRI is releasing an update to ForecastBench—our benchmark that tracks how accurate LLMs are at forecasting real-world events. A trend extrapolation of our results suggests LLMs will reach
8
29
118
@SchoeneggerPhil
Philipp Schoenegger
5 months
Some personal news! Next month I will be joining Microsoft AI, working on the economic effects of advanced AI. After an amazing time at LSE, I'm really excited to contribute to this important area of research at Microsoft during such a pivotal moment for AI!
5
1
72
@_simonsmith
Simon Smith
5 months
Prompt engineering has negligible and sometimes negative effects on models' ability to forecast. I feel like this reflects decreasing benefit of prompt engineering as models get more sophisticated, but it could simply mean that we haven't yet discovered a good forecasting prompt.
@SchoeneggerPhil
Philipp Schoenegger
5 months
New preprint with @CamrobJones @PTetlock and Mellers! We test how much prompt engineering can impact LLM forecasting capabilities of o1, o1-mini, 4o, Sonnet, Haiku & Llama, finding that simple-to-moderate prompt engineering has little or no effect, with some prompts backfiring!
0
1
5
@camrobjones
Cameron Jones
5 months
Coupled with other results like @SchoeneggerPhil's other recent paper on the effectiveness of RL for improving forecasting it seems like general prompting might not be the most promising place to push atm.
Tweet card summary image
arxiv.org
Reinforcement Learning with Verifiable Rewards (RLVR) has been an effective approach for improving Large Language Models' reasoning in domains such as coding and mathematics. Here, we apply RLVR...
1
1
4
@camrobjones
Cameron Jones
5 months
Really enjoyed working on this with @SchoeneggerPhil @PTetlock and Barbara Mellers. We tried ~50 prompt techniques (including AI classics and more theory-motivated ones) on 100 forecasting questions across 6 LLMs. No prompt showed robust improvements! https://t.co/NTnkgrWhPO
Tweet card summary image
arxiv.org
Large language model performance can be improved in a large number of ways. Many such techniques, like fine-tuning or advanced tool usage, are time-intensive and expensive. Although prompt...
@SchoeneggerPhil
Philipp Schoenegger
5 months
New preprint with @CamrobJones @PTetlock and Mellers! We test how much prompt engineering can impact LLM forecasting capabilities of o1, o1-mini, 4o, Sonnet, Haiku & Llama, finding that simple-to-moderate prompt engineering has little or no effect, with some prompts backfiring!
1
8
16
@SchoeneggerPhil
Philipp Schoenegger
5 months
Overall, our preregistered analyses suggest that low-to-medium effort prompt engineering is unlikely to be an especially effective way of improving forecasting performance of current LLMs. Full paper with all prompts in the appendix:
Tweet card summary image
arxiv.org
Large language model performance can be improved in a large number of ways. Many such techniques, like fine-tuning or advanced tool usage, are time-intensive and expensive. Although prompt...
0
0
4
@SchoeneggerPhil
Philipp Schoenegger
5 months
A few specific findings: 1) Frequency-based reasoning, base-rate first, and CoT prompts do improve accuracy (slightly). 2) Simple Bayesian reasoning prompts surprisingly backfire. 3) More elaborate prompts have similarly small effects, though some prompts still backfire.
1
1
4
@SchoeneggerPhil
Philipp Schoenegger
5 months
New preprint with @CamrobJones @PTetlock and Mellers! We test how much prompt engineering can impact LLM forecasting capabilities of o1, o1-mini, 4o, Sonnet, Haiku & Llama, finding that simple-to-moderate prompt engineering has little or no effect, with some prompts backfiring!
3
10
36
@JeffLadish
Jeffrey Ladish
5 months
This study is important because it showed that AI outperforms incentivized humans when persuading people of both true and false claims Persuasion has always been an extremely valuable skill in business, politics, and most other competitive domains https://t.co/0jLClKXrdr
Tweet card summary image
arxiv.org
We directly compare the persuasion capabilities of a frontier large language model (LLM; Claude Sonnet 3.5) against incentivized human persuaders in an interactive, real-time conversational quiz...
2
4
23
@bohannon_bot
John Bohannon
5 months
💰💰💰Prediction markets are going to get weird. Now we have a smallish open source LLM (14B) that can be trained to predict messy real-world outcomes better than GPT o1. ~thread~
Tweet card summary image
arxiv.org
Reinforcement Learning with Verifiable Rewards (RLVR) has been an effective approach for improving Large Language Models' reasoning in domains such as coding and mathematics. Here, we apply RLVR...
1
1
2
@MichelIvan92347
Agent B
5 months
Interesting work here on a 14b LLM for PM forecasting 👇 Check the adapted RL in particular. Nice results on calibration error. This open the door for production tools in the domain imo. Bravo to the team ! 👏 2nd interesting work in a few months on this underrated topic.
@SchoeneggerPhil
Philipp Schoenegger
5 months
New preprint! We trained a 14B LLM on 110k yes/no events with outcome-only RL (ReMax + GRPO tweaks), matching frontier model o1 on accuracy and halving its calibration error, yielding a hypothetical $127 vs $92 profit (+$35).
0
1
3
@SchoeneggerPhil
Philipp Schoenegger
5 months
Co-authored with: @BTurtel, @DanTheManDev, Kris Skotheim, and @lukebeehewitt Paper: https://t.co/tOCNT1Wo9f
0
0
1
@SchoeneggerPhil
Philipp Schoenegger
5 months
New preprint! We trained a 14B LLM on 110k yes/no events with outcome-only RL (ReMax + GRPO tweaks), matching frontier model o1 on accuracy and halving its calibration error, yielding a hypothetical $127 vs $92 profit (+$35).
6
6
23
@BTurtel
Ben Turtel
5 months
🚨 New preprint from @lightningrodai, in collaboration with @SchoeneggerPhil & @lukebeehewitt 🚨 We trained a compact reasoning model that's state-of-the-art at predicting the future. We massively outperform frontier models at prediction market betting, despite being a fraction
1
5
12
@fraslv
Francesco Salvi
5 months
I also have another preprint out with @SchoeneggerPhil et al. showing similar results on Claude Sonnet 3.5 in interactive quizzes with highly incentivised humans, both in truthful and deceptive persuasion. More on this at:
@SchoeneggerPhil
Philipp Schoenegger
6 months
New preprint out with an amazing 40-person team! We find that Claude 3.5 Sonnet outperforms incentivised human persuaders in a >1000-participant live quiz-chat in deceptive and truthful directions!
0
1
2
@mgreinecke
Madeline G. Reinecke is on bsky
5 months
Proud to be a small part of this fantastic team -- check out our pre-print on #LLM #persuasion at https://t.co/zAmqQkbumY.
Tweet card summary image
arxiv.org
We directly compare the persuasion capabilities of a frontier large language model (LLM; Claude Sonnet 3.5) against incentivized human persuaders in an interactive, real-time conversational quiz...
@SchoeneggerPhil
Philipp Schoenegger
6 months
New preprint out with an amazing 40-person team! We find that Claude 3.5 Sonnet outperforms incentivised human persuaders in a >1000-participant live quiz-chat in deceptive and truthful directions!
0
1
4