Hamed Mahdavi
@HamedMahdavi93
Followers
662
Following
4K
Media
27
Statuses
422
Ph.D. Student at Pennsylvania State University
Pennsylvania, USA
Joined April 2019
(1/n) Until recently, even strong LLMs struggled with USAMO/IMO problems. This year, specific model variants from Google and OpenAI’s were reported to solve 5/6 IMO problems. In our recent work, we asked a relevant question: Can we grade proofs fairly with partial credit using
1
22
96
🚀 Now in PRA! Lindblad dynamics normally demand huge depth. We run multiple shallow Lindblad simulations (Kraus form or dilated-Hamiltonians) and extrapolate. Yields polylog depth, Gevrey smoothness, and rigorous bias–variance guarantees. https://t.co/IudeayTt4W
0
1
2
What are the best data curation and synthetic data works you’ve seen this year at NeurIPS? Share it with me.
0
0
4
We arrived just 20 minutes before they closed the gates for our connecting flight.
0
0
1
We keep scaling model parameters by increasing width and stacking more layers, but what if the truly missing axes for continual learning are compression and stacking the learning process? Excited to share the full version of Nested Learning, a new paradigm for continual learning
28
150
965
Had a blast talking about privacy and agentic AI at the @farairesearch alignment workshop! 1. Stop worrying about memorization as a privacy concern 2. Optimizing for math and coding tasks is NOT going to give us models that are better for *humans*! (See graph!) Slides ⬇️
2
11
212
Don't do best of n, do majority of bests! Follow this nice work by @AminRakhsha, @AmirKhasahmadi and @SoloGen.
We are presenting our paper on test-time compute at #NeurIPS2025 🤔Running Best-of-N 1000 times and picking the most frequent answer works better than a single BoN. We make it cheap✨ Don't generate new outputs for each run. Sample with replacement from the existing ones! 🧵
0
1
12
It’s snowing on the East Coast right now, but I’ll be in San Diego soon for NeurIPS! I work on reasoning, synthetic data, and agentic workflows for reasoning. I’m open to jobs, internships, and collaborations. Always happy to chat, whether in person or via DM😎
0
0
6
We are presenting our paper on test-time compute at #NeurIPS2025 🤔Running Best-of-N 1000 times and picking the most frequent answer works better than a single BoN. We make it cheap✨ Don't generate new outputs for each run. Sample with replacement from the existing ones! 🧵
1
10
22
Our new paper on LLMs test-time computation! #Neurips2025 Majority-of-the-Bests (MoB) improves Best-of-N with negligible cpu cost. Check it out!
We are presenting our paper on test-time compute at #NeurIPS2025 🤔Running Best-of-N 1000 times and picking the most frequent answer works better than a single BoN. We make it cheap✨ Don't generate new outputs for each run. Sample with replacement from the existing ones! 🧵
0
3
10
🚨🚨New blog post led by CMU students: Want to know why LLM RL training plateaus on hard problems & scaling compute may not help? And how to fix this issue? Turns out it stems from a coupling of poor exploration & optimization. Classical ways to explore don't work, but ours
6
44
248
Meet DRTulu: our open deep-research agent built for long-form, open-ended deep research tasks, trained with our new RLER method. DR Tulu rivals or is even better than proprietary deep research systems like Perplexity or OpenAI on several benchmarks.
Today we’re releasing Deep Research Tulu (DR Tulu)—the first fully open, end-to-end recipe for long-form deep research, plus an 8B agent you can use right away. Train agents that plan, search, synthesize, & cite across sources, making expert research more accessible. 🧭📚
2
8
38
Last year, AlphaProof & AlphaGeometry reached a key landmark in AI by achieving silver medal level performance at the International Math Olympiad. Today, @Nature is publishing the methodology behind our amazing agent AlphaProof! @GoogleDeepMind Paper:
nature.com
Nature - Olympiad-level formal mathematical reasoning with reinforcement learning
8
84
439
🚀 Announcing GroundCUA, a high-quality dataset for grounding computer-use agents. With over 3M expert annotations spanning 87 desktop apps, we use our new dataset to train state-of-the-art grounding models, namely GroundNext-3B and GroundNext-7B. 👇 Thread
5
31
81
Computer-use agents don’t touch the UI anymore; they do the high-level planning and call a "grounding" agent to click and type. @aarashfeizi et al. proposed a recipe to create to SOTA grounding agents: from the data-collection to RL pipeline design. Check it out.
🚀 Announcing GroundCUA, a high-quality dataset for grounding computer-use agents. With over 3M expert annotations spanning 87 desktop apps, we use our new dataset to train state-of-the-art grounding models, namely GroundNext-3B and GroundNext-7B. 👇 Thread
0
1
17
LLM-generated reviews be like: "The paper presents a self-driving car. However, a key limitation is that it does not fly."
0
0
8
I'm really excited about our new paper!! 📣 'Reinforcement Learning Improves Traversal of Hierarchical Knowledge in LLMs' Contrary to belief that RL ft degrades memorized knowledge, RL-enhanced models consistently outperform base/SFT on knowledge recall by 24pp! RL teaches
18
50
421
Btw, I got this idea from James Martens' work, and I suggest reading his work for understanding how working optimizers for deep neural networks are developed: https://t.co/Oo4vZLY7cq. James Martens has also been a contributor to probably one of the first large NNs ever trained at
0
1
2