Lorenz Kuhn @_lorenzkuhn X Profile

Lorenz Kuhn

@_lorenzkuhn

Followers

1K

Following

961

Media

44

Statuses

247

Reasoning Research @OpenAI | o1-preview through o3

Joined January 2014

Don't wanna be here? Send us removal request.

Lorenz Kuhn

@_lorenzkuhn

3 years

How can we measure how uncertain LLMs are about their generations?.In our spotlight at #iclr2023 with @yaringal & @sebfar, we introduce "semantic entropy", the entropy over meanings rather than sequences, and show that it reliably measures LM uncertainty🧵

arxiv.org

We introduce a method to measure uncertainty in large language models. For tasks like question answering, it is essential to know when we can trust the natural language outputs of foundation...

5

29

184

Lorenz Kuhn

@_lorenzkuhn

18 days

It's been a pleasure working on this with @ahelkky @andresnds @clavera_i @MostafaRohani and many others!.

0

17

Grok

@grok

3 days

Join millions who have switched to Grok.

150

284

2K

Lorenz Kuhn

@_lorenzkuhn

18 days

Just two years ago, our smartest models could barely solve the easiest competitive programming problems. Last week, our latest reasoning models achieved a gold medal score at the International Olympiads of Informatics. Competitive programming is one of the cleanest examples of.

Sheryl Hsu

@SherylHsu02

19 days

1/n I’m thrilled to share that our @OpenAI reasoning system scored high enough to achieve gold 🥇🥇 in one of the world’s top programming competitions - the 2025 International Olympiad in Informatics (IOI) - placing first among AI participants! 👨‍💻👨‍💻

12

8

149

Lorenz Kuhn

@_lorenzkuhn

1 month

RT @MilesKWang: IMO gold is a win for scaling ~nearly~ superhuman oversight on a fuzzy, hard-to-verify RL domain.

0

3

0

Lorenz Kuhn

@_lorenzkuhn

1 month

It was thrilling to watch AI compete against some of the best human competitive programmers at AtCoder World Finals Heuristics yesterday. Check out @andresnds ‘s thread on how the AI solutions improved throughout the 10h contest. Congrats to @FakePsyho on 1st place!.

Andre Saraiva

@andresnds

1 month

1/N Yesterday in Tokyo we @OpenAI ran a 10‑hour live Humans vs AI exhibition at the AtCoder World Tour Finals Heuristic. We pointed an OpenAI reasoning model at the same brutal problem the finalists tackled—no human help, same rules, same clock. Buckle up. 👇

1

3

48

Lorenz Kuhn

@_lorenzkuhn

1 month

RT @ahelkky: Congratulations @FakePsyho on a nail-biting performance! Great showings as well from @bminaiev, @andresnds, and @_lorenzkuhn r….

0

4

0

Lorenz Kuhn

@_lorenzkuhn

7 months

Two important points from our new technical report:.1. Scaling continues to work and the bitter lesson still holds.2. Recent AI models are strong at reasoning tasks and are rapidly becoming stronger — 4o was released less than a year ago, o1 less than six months ago.

Ahmed El-Kishky

@ahelkky

7 months

11/ Since competitive programming is just one facet of coding, o3 contributors also evaluated models on software engineering tasks. While there’s still a long way to go, it’s clear that learning to reason through RL improves SWE capabilities.

0

7

Lorenz Kuhn

@_lorenzkuhn

1 year

i generally feel super grateful that i get to work with such exceptionally skilled and kind people on reasoning research. the sprint for IOI in particular was special though. IOI 2024 gold @ 10k submissions; 49th percentile of competitors under real contest conditions.

Mark Chen

@markchen90

1 year

As a coach for the US IOI team, I’ve been motivated for a long time to create models which can perform at the level of the most elite competitors in the world. Check out our research blog post - with enough samples, we achieve gold medal performance on this year’s IOI and ~14/15.

0

8

Lorenz Kuhn

@_lorenzkuhn

1 year

RT @polynoamial: Today, I’m excited to share with you all the fruit of our effort at @OpenAI to create AI models capable of truly general r….

0

2K

0

Lorenz Kuhn

@_lorenzkuhn

1 year

very excited about these models helping people solve hard problems and proud of the work we did. give the new models a try!.

OpenAI

@OpenAI

1 year

We're releasing a preview of OpenAI o1—a new series of AI models designed to spend more time thinking before they respond. These models can reason through complex tasks and solve harder problems than previous models in science, coding, and math.

1

0

16

Lorenz Kuhn

@_lorenzkuhn

1 year

RT @MillionInt: We trained a model and it is good in some things.

0

46

0

Lorenz Kuhn

@_lorenzkuhn

1 year

RT @LiamFedus: But the ELO can ultimately become bounded by the difficulty of the prompts (i.e. can’t achieve arbitrarily high win rates on….

0

92

0

Lorenz Kuhn

@_lorenzkuhn

2 years

rainy day in sf.

2

0

6

Lorenz Kuhn

@_lorenzkuhn

2 years

RT @ajeya_cotra: Excellent post by @JacobSteinhardt trying to forecast the abilities of models that could be trained in 2030: https://t.co/….

bounded-regret.ghost.io

GPT-4 surprised many people with its abilities at coding, creative brainstorming, letter-writing, and other skills. How can we be less surprised by developments in machine learning? In this post,...

0

8

0

Lorenz Kuhn

@_lorenzkuhn

2 years

RT @anndvision: new preprint. "ReLU to the Rescue: Improve your On-policy Actor-Critic with Positive Advantages". shockingly simple changes….

0

17

0

Lorenz Kuhn

@_lorenzkuhn

2 years

RT @seb_far: The Google DeepMind alignment team is looking for research scientists and research engineers to help us work towards safe AGI.….

0

3

0

Lorenz Kuhn

@_lorenzkuhn

2 years

Mark Chen

@markchen90

2 years

less is generally more for alignment, but not for capabilities.

0

Lorenz Kuhn

@_lorenzkuhn

2 years

RT @DeepMind: With more powerful AI systems comes more responsibility to identify novel capabilities in models. 🔍. Our new research looks a….

0

169

0

Lorenz Kuhn

@_lorenzkuhn

2 years

Also, finetuning on this scale barely affects the model performance on these benchmarks, see e.g. Llama 7B vs Alpaca 7B.

1

0

1

Lorenz Kuhn

@_lorenzkuhn

2 years

Eval results:

1

0

1

Lorenz Kuhn

@_lorenzkuhn

2 years

The OS/small model + finetuning approach might be good enough for many applications? How well does academic benchmark perf correlate with human preferences over generations in different settings? The self-instruct human eval might not be sensitive enough to what we care about?.

1

0

2