James Zhao Profile
James Zhao

@xu_Zhao0

Followers
8
Following
46
Media
5
Statuses
10

PhD candidate @ NUS

Singapore
Joined May 2023
Don't wanna be here? Send us removal request.
@xu_Zhao0
James Zhao
3 months
🚨 Models like GPT-5-mini and Gemini 2.5 Flash hallucinate more with longer thinking! 🤔Longer thinking encourages the model to attempt more questions. It also induces confirmation bias, resulting in overconfident hallucinations. Check out https://t.co/nsVArMooaZ #LLM
2
2
5
@chchenhui
Hui Chen
5 months
🤖How well can AI agents conduct open-ended machine learning research? 🚀Excited to share our latest #AI4Research benchmark, MLR-Bench, for evaluating AI agents on open-ended machine learning research!!📈 https://t.co/qTtVY4YPlh 1/
1
6
11
@xu_Zhao0
James Zhao
3 months
To summarize, our findings highlight the limitations of current test-time scaling approaches for knowledge-intensive tasks. While enabling “thinking” can be helpful, allocating more test-time computation is not yet a reliable way to improve factual robustness in LLMs. (6/6)
0
0
0
@xu_Zhao0
James Zhao
3 months
Finally, we find that compared to non-thinking, enabling the model to “think” still offers benefits. It improves accuracy, especially on tasks requiring multi-hop reasoning (e.g., FRAMES). It also reduces hallucinations for most models, though not for Gemini 2.5 Flash. (5/N)
1
0
0
@xu_Zhao0
James Zhao
3 months
🧐By examining the reasoning traces of gpt-oss-20b, we find that longer thinking can induce confirmation bias, where the model fabricates information to support its initial belief. For Gemini 2.5 Flash, incomplete reasoning often results in abstention. (4/N)
1
0
0
@xu_Zhao0
James Zhao
3 months
Specifically, fewer hallucinations often result from the model choosing to abstain after thinking more. More hallucinations come from the model attempting previously unanswered questions with longer reasoning. (3/N)
1
0
0
@xu_Zhao0
James Zhao
3 months
Test-time scaling sometimes leads to fewer hallucinations (Grok-3 mini, DeepSeek-R1-Distill-Qwen-14B), and sometimes more hallucinations (GPT-5 mini, o3-mini, gpt-oss-20b, Gemini 2.5 Flash). Hallucination changes are largely driven by the model’s willingness to answer. (2/N)
1
0
0
@xu_Zhao0
James Zhao
3 months
❓Is test-time scaling in reasoning models effective for knowledge-intensive tasks? 🧵We evaluate 12 reasoning models under increased test-time computation. Results show that more thinking does not consistently improve accuracy or reduce hallucinations for most models. (1/N)
1
0
0