Kyle Montgomery Profile
Kyle Montgomery

@kylepmont

Followers
39
Following
0
Media
7
Statuses
14

PhD student at UC Santa Cruz

Santa Cruz, CA
Joined January 2016
Don't wanna be here? Send us removal request.
@kylepmont
Kyle Montgomery
10 days
Huge thanks to our amazing collaborators @sijun_tan, @YuqiChen74741, @SiyuanZhuang3, @tianjun_zhang, @ralucaadapopa, @ChenguangWang, and the @rllm_project team.
0
0
0
@kylepmont
Kyle Montgomery
10 days
⏱️ From the latency perspective, the comparison is even more stark. For example, verifying 32 solutions with a 1.5B discriminative verifier is ~1000x faster than generative verification (1.66s vs 1711.8s). Under inference budgets below 22.5 minutes, hybrid discriminative
1
0
0
@kylepmont
Kyle Montgomery
10 days
⚙️ Under practical compute budgets, hybrid discriminative verification beats generative verification. On AIME2025, at 5 × 10¹⁵ and 1 × 10¹⁶ FLOPs, hybrid techniques outperform generative verification by 6.1% and 2.5%, respectively. The reason is straightforward:
1
0
0
@kylepmont
Kyle Montgomery
10 days
🧩 Test-time scaling improves reasoning by allocating more compute at inference, such as by sampling several candidate solutions and having a verifier select the best. While generative verifiers often outperform discriminative verifiers, they can require upwards of 1000x more
1
0
0
@kylepmont
Kyle Montgomery
10 days
🚨 New preprint: Budget-aware Test-time Scaling via Discriminative Verification 👉 https://t.co/6Q26TwA0xw We show that discriminative verification is the best option for test-time scaling under 25.5 minutes, outperforming state-of-the-art generative verification in both
1
3
9
@kylepmont
Kyle Montgomery
13 days
Thrilled to have been a part of this release — looking forward to what’s coming next with rLLM!
@rllm_project
rLLM
13 days
🚀 Introducing rLLM v0.2 - train arbitrary agentic programs with RL, with minimal code changes. Most RL training systems adopt the agent-environment abstraction. But what about complex workflows? Think solver-critique pairs collaborating, or planner agents orchestrating multiple
0
2
3
@kylepmont
Kyle Montgomery
3 months
Swing by the📍KnowFM workshop at #ACL2025 to learn more. Huge thanks 🙏 to David Park, @TuJianhong, @bemikelive, @belizgunel, @dawnsongtweets, and @ChenguangWang.
0
0
2
@kylepmont
Kyle Montgomery
3 months
Finally, we verify that our fits extrapolate well to out-of-distribution amounts of compute and context, showcasing the usefulness of our method for long-context scaling experiments.
1
0
1
@kylepmont
Kyle Montgomery
3 months
We achieve strong fits on tasks like arithmetic reasoning and common sense reasoning. Additionally, we observe that LLMs make better use of context on arithmetic reasoning tasks than on common sense reasoning tasks.
1
0
0
@kylepmont
Kyle Montgomery
3 months
We devise a novel pipeline to collect data with differing amounts of context, then fit our scaling equation on this data using an approach that combines global and local search.
1
0
0
@kylepmont
Kyle Montgomery
3 months
Excited to share our latest work at KnowFM at #ACL2025. Predicting Task Performance with Context-aware Scaling Laws models performance on downstream tasks as a function of training compute and context length – ✅ simple, ✅ interpretable, and ✅ effective.
1
4
6
@kylepmont
Kyle Montgomery
6 months
Excited to share our work at #ICLR2025! JudgeBench ⚖️ tests the reliability of LLM-based judges with a focus on objective correctness. JudgeBench converts tough 🧠 datasets in knowledge, reasoning, math & code into labeled response pairs, forcing objective grading over vibes.
0
1
4
@sijun_tan
Sijun Tan
1 year
Introducing JudgeBench – the ultimate benchmark designed to push LLM-based judges to their limits! 🚀 ❓Why do we need a new benchmark for LLM-based judges? As LLMs continues to evolve, their responses become more complex, demanding stronger judges to assess them accurately.
Tweet card summary image
arxiv.org
LLM-based judges have emerged as a scalable alternative to human evaluation and are increasingly used to assess, compare, and improve models. However, the reliability of LLM-based judges...
0
10
17