Gal Yona
@_galyo
Followers
499
Following
1K
Media
32
Statuses
300
Research scientist @googleai, previously CS PhD @weizmannscience
Joined October 2009
Switched dinner tonight to the tiny table that looks just like the one at my kidโs kindergarten. instantly got the "kindergarten" version of the dining experience: he was super independent & way more chill and well behaved. I feel like I finally get persona prompting for LLMs ๐
0
0
8
unpopular opinion (?): text outputs (like the one below ๐คฏ) can excite, intrigue and move me in ways that no fancy Nano Banana generated image will ever be able to do. as modalities, text is just a gazillion times more interesting than image.
0
0
0
This is not a particularly good take and is indicative of a fundamental misunderstanding of what a top-tier technical college education is suppose to offer. Preparing to understand modern AI as a Harvard or Stanford undergrad is not about learning "prompt engineering", vibe
Harvard and Stanford students tell me their professors don't understand AI and the courses are outdated. If elite schools can't keep up, the credential arms race is over. Self-learning is the only way now.
48
163
2K
If you're working on factuality in LLMs, please check out our release of SimpleQA Verified โ
- a new & improved benchmark for reliably measuring progress in short form factuality!
We challenged ourselves to build the cleanest, highest-signal factuality benchmark out there. Today, we're releasing the result: SimpleQA Verified โ
๐ฅ On this more reliable, 1,000-prompt eval, Gemini 2.5 Pro establishes a new SOTA, outperforming other frontier models. We're
0
0
1
๐ New Benchmark Launch: SimpleQA Verified! Weโve partnered with @GoogleDeepMind and @GoogleResearch to launch a curated 1,000-prompt benchmark designed to provide a more reliable and challenging evaluation of LLM short-form factuality. Check out the leaderboard here:
3
8
124
really wish more talks in CS/ML were like this (surely seminars, maybe confs?). It's quite obv that over a uniform sample of accepted Neurips papers, Pr[results will be insightful] << Pr[hearing about the "behind the scenes" of the project will be insightful] (for me personally)
Imagine going to a seminar and listening to the speaker also talk about how the big idea happened. Join us Sept. 22 for the first talk in the "Night Science Seminar Series" where I'll talk about cellular plasticity and also discuss how the idea came about! https://t.co/NL50oO9gsD
0
0
5
๐ฅณ๐ฅณ Happy to share that we have three papers accepted to EMNLP 2025 ๐จ๐ณ (2 main, 1 findings)! What makes this special is that all three belong to a new research line I began last year: LLM-as-a-judge/LLM-as-an-annotator ๐ค๐งโโ๏ธ
2
13
130
+100 for this (surprisingly short) take! "writing the paper" is not something that happens at the END of a research project.. it's an integral part of it. personally, blindly offloading that part to an LLM would be the surest way to hurt the quality of my research.
0
1
10
new work by @pybeebee shows that LLMs still struggle to faithfully express their uncertainty in words, but cool to see that meta cognitive inspired prompting can go a long way. looking forward to seeing more positive results on this fundamental problem!
๐ฅ Excited to share MetaFaith: Understanding and Improving Faithful Natural Language Uncertainty Expression in LLMs๐ฅ How can we make LLMs talk about uncertainty in a way that truly reflects what they internally "know"? Check out our new preprint to find out! Details in ๐งต(1/n):
0
1
2
ืืฆื ืืืฉืืจื: ืืฉืชืืฉื ืืฆ'ื GPT ืฉืืืฆืื ืขืืืจื ืืืง ืืืฉ ืขื ืื ืช ืื ืฆื ืืืืื ืืืืจืืช ืคืืืคืื ืืืืื ืืืืช ืืฉืคื ืืฉืืื ืืืืจื. ืืฉืืคื ืืื ืืืื ืืฉืืืืจ ืืชืืื: "30 ืฉื ื ืื ื ืฉืืคื ืืืฉืืชื ืฉืจืืืชื ืืื. ืื ืจืื ืฉืืขืืชื"
101
182
2K
we write too much. more than we can read, and many small incremental things. i think there should be some mechanism to restrict paper submissions and acceptances per person per year, to force people to prioritize their best work, and invest more in it.
๐คฏNeurIPS 2025 might break records as the most submitted-to academic conference ever. One of our submission IDs is already ~23,000 โ final count could hit 30,000. Absolute madness. #NeurIPS2025 #AI
28
29
610
@sama the single biggest thing you could do for safety/alignment is to put a massive emphasis in the RL feedback loop on basic HONESTY and never misleading, tricking, overstating, exaggerating, etc. It should be like touching a hot stove to the model. Just like how you raise kids
9
4
170
This was a great 30-minute conceptual read. It neatly ties together classic RL, LLMs of the past few years, and where agents are headed next. Honestly, I find the future of agents interacting w the world with less human mediation ("experiencing") both exciting and terrifying
@dsivakumar The short paper "Welcome to the Era of Experience" is literally just released, like this week. Ultimately it will become a chapter in the book 'Designing an Intelligence' edited by George Konidaris and published by MIT Press. https://t.co/Y6m4jLRjnh
0
0
3
[[ for kicks, I asked chatGPT to rewrite my tweet in MAVERICK style. very useful in truly bringing home the message of how obnoxious this response style truly is!!! ๐
๐
]]
1
0
5
tbc, Iโm not saying the benchmark is useless. If you're optimizing purely for likability, itโs probably useful to know that the average user enjoys this kind of overly enthusiastic fluff. but it can't be taken seriously as a measure of utility for general-purpose LLMs.
1
0
1
with the battles being public, itโs now glaringly obvious that completely unfactual responses can easily win, so long as theyโre delivered in an aggressively upbeat tone and cheerfully long-winded style.
1
0
2
my completely personal take: Llama-4 blatantly gaming the Chatbot Arena evals (beyond being a neat example of Goodhartโs law in action!) is an important moment for the NLP community โฉ
This is the clearest evidence that no one should take these rankings seriously. In this example it's super yappy and factually inaccurate, and yet the user voted for Llama 4. The rest aren't any better.
1
1
8
.@percyliang & @tatsu_hashimoto start the 2nd offering of CS336 Language Modeling from Scratch at @stanfordnlp. The class philosophy is Understanding by Building. We need many people who understand the detailed design of modern LLMs, not just a few at โfrontierโ ๐คญ AI companies.
9
33
242