Kyle Montgomery @kylepmont X Profile

Kyle Montgomery

@kylepmont

Followers

39

Following

0

Media

7

Statuses

14

PhD student at UC Santa Cruz

Santa Cruz, CA

Joined January 2016

Don't wanna be here? Send us removal request.

Kyle Montgomery

@kylepmont

10 days

Huge thanks to our amazing collaborators @sijun_tan, @YuqiChen74741, @SiyuanZhuang3, @tianjun_zhang, @ralucaadapopa, @ChenguangWang, and the @rllm_project team.

0

Kyle Montgomery

@kylepmont

10 days

⏱️ From the latency perspective, the comparison is even more stark. For example, verifying 32 solutions with a 1.5B discriminative verifier is ~1000x faster than generative verification (1.66s vs 1711.8s). Under inference budgets below 22.5 minutes, hybrid discriminative

1

0

Kyle Montgomery

@kylepmont

10 days

⚙️ Under practical compute budgets, hybrid discriminative verification beats generative verification. On AIME2025, at 5 × 10¹⁵ and 1 × 10¹⁶ FLOPs, hybrid techniques outperform generative verification by 6.1% and 2.5%, respectively. The reason is straightforward:

1

0

Kyle Montgomery

@kylepmont

10 days

🧩 Test-time scaling improves reasoning by allocating more compute at inference, such as by sampling several candidate solutions and having a verifier select the best. While generative verifiers often outperform discriminative verifiers, they can require upwards of 1000x more

1

0

Kyle Montgomery

@kylepmont

10 days

🚨 New preprint: Budget-aware Test-time Scaling via Discriminative Verification 👉 https://t.co/6Q26TwA0xw We show that discriminative verification is the best option for test-time scaling under 25.5 minutes, outperforming state-of-the-art generative verification in both

1

3

9

Kyle Montgomery

@kylepmont

13 days

Thrilled to have been a part of this release — looking forward to what’s coming next with rLLM!

rLLM

@rllm_project

13 days

🚀 Introducing rLLM v0.2 - train arbitrary agentic programs with RL, with minimal code changes. Most RL training systems adopt the agent-environment abstraction. But what about complex workflows? Think solver-critique pairs collaborating, or planner agents orchestrating multiple

0

2

3

Kyle Montgomery

@kylepmont

3 months

Swing by the📍KnowFM workshop at #ACL2025 to learn more. Huge thanks 🙏 to David Park, @TuJianhong, @bemikelive, @belizgunel, @dawnsongtweets, and @ChenguangWang.

0

2

Kyle Montgomery

@kylepmont

3 months

Finally, we verify that our fits extrapolate well to out-of-distribution amounts of compute and context, showcasing the usefulness of our method for long-context scaling experiments.

1

0

1

Kyle Montgomery

@kylepmont

3 months

We achieve strong fits on tasks like arithmetic reasoning and common sense reasoning. Additionally, we observe that LLMs make better use of context on arithmetic reasoning tasks than on common sense reasoning tasks.

1

0

Kyle Montgomery

@kylepmont

3 months

We devise a novel pipeline to collect data with differing amounts of context, then fit our scaling equation on this data using an approach that combines global and local search.

1

0

Kyle Montgomery

@kylepmont

3 months

Excited to share our latest work at KnowFM at #ACL2025. Predicting Task Performance with Context-aware Scaling Laws models performance on downstream tasks as a function of training compute and context length – ✅ simple, ✅ interpretable, and ✅ effective.

1

4

6

Kyle Montgomery

@kylepmont

6 months

Excited to share our work at #ICLR2025! JudgeBench ⚖️ tests the reliability of LLM-based judges with a focus on objective correctness. JudgeBench converts tough 🧠 datasets in knowledge, reasoning, math & code into labeled response pairs, forcing objective grading over vibes.

0

1

4

Sijun Tan

@sijun_tan

1 year

Introducing JudgeBench – the ultimate benchmark designed to push LLM-based judges to their limits! 🚀 ❓Why do we need a new benchmark for LLM-based judges? As LLMs continues to evolve, their responses become more complex, demanding stronger judges to assess them accurately.

arxiv.org

LLM-based judges have emerged as a scalable alternative to human evaluation and are increasingly used to assess, compare, and improve models. However, the reliability of LLM-based judges...

0

10

17