Philipp Schoenegger @SchoeneggerPhil X Profile

Philipp Schoenegger

@SchoeneggerPhil

Followers

2K

Following

17K

Media

583

Statuses

3K

@Microsoft AI Futures

https://t.co/3cSgvJe7aM

London, England

Joined April 2014

Don't wanna be here? Send us removal request.

Stefan Schubert

@StefanFSchubert

2 days

In a @metaculus forecasting tournament ten human pros and 96 bots participated. Each human beat all the bots. The big drop-off in the chart below is between the last human and the first bot.

7

23

208

Mustafa Suleyman

@mustafasuleyman

7 days

All of today's @Copilot announcements boil down to one core idea: we're betting on humanist AI. An AI that always puts humans first. - Copilot Groups - AI browser - our new character Mico - memory updates - Copilot for health + more in this morning's event https://t.co/GNtBAC8Nh6

128

178

1K

Mustafa Suleyman

@mustafasuleyman

16 days

Meet our third @MicrosoftAI model: MAI-Image-1 #9 on LMArena, striking an impressive balance of generation speed and quality Excited to keep refining + climbing the leaderboard from here! We're just getting started. https://t.co/33BiNfIjPg

36

82

510

Forecasting Research Institute

@Research_FRI

22 days

Is AI on track to match top human forecasters at predicting the future? Today, FRI is releasing an update to ForecastBench—our benchmark that tracks how accurate LLMs are at forecasting real-world events. A trend extrapolation of our results suggests LLMs will reach

8

29

118

Philipp Schoenegger

@SchoeneggerPhil

5 months

Some personal news! Next month I will be joining Microsoft AI, working on the economic effects of advanced AI. After an amazing time at LSE, I'm really excited to contribute to this important area of research at Microsoft during such a pivotal moment for AI!

5

1

72

Simon Smith

@_simonsmith

5 months

Prompt engineering has negligible and sometimes negative effects on models' ability to forecast. I feel like this reflects decreasing benefit of prompt engineering as models get more sophisticated, but it could simply mean that we haven't yet discovered a good forecasting prompt.

Philipp Schoenegger

@SchoeneggerPhil

5 months

New preprint with @CamrobJones @PTetlock and Mellers! We test how much prompt engineering can impact LLM forecasting capabilities of o1, o1-mini, 4o, Sonnet, Haiku & Llama, finding that simple-to-moderate prompt engineering has little or no effect, with some prompts backfiring!

0

1

5

Cameron Jones

@camrobjones

5 months

Coupled with other results like @SchoeneggerPhil's other recent paper on the effectiveness of RL for improving forecasting it seems like general prompting might not be the most promising place to push atm.

arxiv.org

Reinforcement Learning with Verifiable Rewards (RLVR) has been an effective approach for improving Large Language Models' reasoning in domains such as coding and mathematics. Here, we apply RLVR...

1

4

Cameron Jones

@camrobjones

5 months

Really enjoyed working on this with @SchoeneggerPhil @PTetlock and Barbara Mellers. We tried ~50 prompt techniques (including AI classics and more theory-motivated ones) on 100 forecasting questions across 6 LLMs. No prompt showed robust improvements! https://t.co/NTnkgrWhPO

arxiv.org

Large language model performance can be improved in a large number of ways. Many such techniques, like fine-tuning or advanced tool usage, are time-intensive and expensive. Although prompt...

Philipp Schoenegger

@SchoeneggerPhil

5 months

New preprint with @CamrobJones @PTetlock and Mellers! We test how much prompt engineering can impact LLM forecasting capabilities of o1, o1-mini, 4o, Sonnet, Haiku & Llama, finding that simple-to-moderate prompt engineering has little or no effect, with some prompts backfiring!

1

8

16

Philipp Schoenegger

@SchoeneggerPhil

5 months

Overall, our preregistered analyses suggest that low-to-medium effort prompt engineering is unlikely to be an especially effective way of improving forecasting performance of current LLMs. Full paper with all prompts in the appendix:

arxiv.org

Large language model performance can be improved in a large number of ways. Many such techniques, like fine-tuning or advanced tool usage, are time-intensive and expensive. Although prompt...

0

4

Philipp Schoenegger

@SchoeneggerPhil

5 months

A few specific findings: 1) Frequency-based reasoning, base-rate first, and CoT prompts do improve accuracy (slightly). 2) Simple Bayesian reasoning prompts surprisingly backfire. 3) More elaborate prompts have similarly small effects, though some prompts still backfire.

1

4

Philipp Schoenegger

@SchoeneggerPhil

5 months

New preprint with @CamrobJones @PTetlock and Mellers! We test how much prompt engineering can impact LLM forecasting capabilities of o1, o1-mini, 4o, Sonnet, Haiku & Llama, finding that simple-to-moderate prompt engineering has little or no effect, with some prompts backfiring!

3

10

36

Jeffrey Ladish

@JeffLadish

5 months

This study is important because it showed that AI outperforms incentivized humans when persuading people of both true and false claims Persuasion has always been an extremely valuable skill in business, politics, and most other competitive domains https://t.co/0jLClKXrdr

arxiv.org

We directly compare the persuasion capabilities of a frontier large language model (LLM; Claude Sonnet 3.5) against incentivized human persuaders in an interactive, real-time conversational quiz...

2

4

23

John Bohannon

@bohannon_bot

5 months

💰💰💰Prediction markets are going to get weird. Now we have a smallish open source LLM (14B) that can be trained to predict messy real-world outcomes better than GPT o1. ~thread~

arxiv.org

Reinforcement Learning with Verifiable Rewards (RLVR) has been an effective approach for improving Large Language Models' reasoning in domains such as coding and mathematics. Here, we apply RLVR...

1

2

Hacker News 50

@betterhn50

5 months

Outcome-Based Reinforcement Learning to Predict the Future https://t.co/P62etMNyBe ( https://t.co/kVbttuoSy6)

arxiv.org

Reinforcement Learning with Verifiable Rewards (RLVR) has been an effective approach for improving Large Language Models' reasoning in domains such as coding and mathematics. Here, we apply RLVR...

0

1

Agent B

@MichelIvan92347

5 months

Interesting work here on a 14b LLM for PM forecasting 👇 Check the adapted RL in particular. Nice results on calibration error. This open the door for production tools in the domain imo. Bravo to the team ! 👏 2nd interesting work in a few months on this underrated topic.

Philipp Schoenegger

@SchoeneggerPhil

5 months

New preprint! We trained a 14B LLM on 110k yes/no events with outcome-only RL (ReMax + GRPO tweaks), matching frontier model o1 on accuracy and halving its calibration error, yielding a hypothetical $127 vs $92 profit (+$35).

0

1

3

Philipp Schoenegger

@SchoeneggerPhil

5 months

Co-authored with: @BTurtel, @DanTheManDev, Kris Skotheim, and @lukebeehewitt Paper: https://t.co/tOCNT1Wo9f

0

1

Philipp Schoenegger

@SchoeneggerPhil

5 months

New preprint! We trained a 14B LLM on 110k yes/no events with outcome-only RL (ReMax + GRPO tweaks), matching frontier model o1 on accuracy and halving its calibration error, yielding a hypothetical $127 vs $92 profit (+$35).

6

23

Ben Turtel

@BTurtel

5 months

🚨 New preprint from @lightningrodai, in collaboration with @SchoeneggerPhil & @lukebeehewitt 🚨 We trained a compact reasoning model that's state-of-the-art at predicting the future. We massively outperform frontier models at prediction market betting, despite being a fraction

1

5

12

Francesco Salvi

@fraslv

5 months

I also have another preprint out with @SchoeneggerPhil et al. showing similar results on Claude Sonnet 3.5 in interactive quizzes with highly incentivised humans, both in truthful and deceptive persuasion. More on this at:

Philipp Schoenegger

@SchoeneggerPhil

6 months

New preprint out with an amazing 40-person team! We find that Claude 3.5 Sonnet outperforms incentivised human persuaders in a >1000-participant live quiz-chat in deceptive and truthful directions!

0

1

2

Madeline G. Reinecke is on bsky

@mgreinecke

5 months

Proud to be a small part of this fantastic team -- check out our pre-print on #LLM #persuasion at https://t.co/zAmqQkbumY.

arxiv.org

We directly compare the persuasion capabilities of a frontier large language model (LLM; Claude Sonnet 3.5) against incentivized human persuaders in an interactive, real-time conversational quiz...

Philipp Schoenegger

@SchoeneggerPhil

6 months

New preprint out with an amazing 40-person team! We find that Claude 3.5 Sonnet outperforms incentivised human persuaders in a >1000-participant live quiz-chat in deceptive and truthful directions!

0

1

4