Alan Li
@alanli2020
Followers
117
Following
145
Media
5
Statuses
63
PhD @YaleNLP | undergrad @UWcse @nlpnoah | interned at @kotoba_tech
Joined November 2021
Thank you @kotoba_tech and special thanks to @jungokasai and @noriyuki_kojima! Wonderful and rewarding experience in Tokyo for the summer, surrounded by such a passionate team of talented engineers. Always excited about Kotoba's next release and look forward to keeping in touch!
Kotoba's former intern, Alan Li (@alanli2020), is starting his CS PhD at @Yale. Best of luck on your PhD journey, and we'll stay in touch!
1
2
6
I will be attending #EMNLP2025 this week to present LiteASR, a compression method for speech encoders (a collaborative work with @kotoba_tech). Catch our poster at the first poster session on Wednesday morning. Happy to chat about efficiency, speech, or both!
🚀 Presenting LiteASR: a method that halves the compute cost of speech encoders by 2x, leveraging low-rank approximation of activations. LiteASR is accepted to #EMNLP2025 (main) @emnlpmeeting
1
3
10
As PhD students, we believe research automation systems should belong to everyone, not just Google, so we built freephdlabor. Customize your multi-agent system for end-to-end research that WORKS FOR YOUR DOMAIN within hours. full source code: https://t.co/NkiFnLwqVM
3
5
8
Love the thread, thank you Rohan!
New Harvard+Yale paper says, strong reasoning helps, but accessing the right knowledge first is what really limits performance. So knowledge recall is the main bottleneck in scientific problem solving with LLMs. They build benchmark suites SCIREAS and SCIREAS‑PRO to measure
1
1
5
0
0
4
8/9 This work is a collaboration between YaleNLP @yalenlp and Ai2 @allen_ai . Code/benchmark 📈 https://t.co/uCVKwpXhvl. Paper: 📄 https://t.co/XP8011DqsU Models: 🤗
huggingface.co
1
0
3
7/9 Takeaways: - Knowledge access remains a bottleneck. - Reasoning improves knowledge recall – even w/o knowledge injection. - Best results: reasoners + external knowledge. - Practitioners: do task-specific evals for cost-efficient large scale application.
1
0
2
6/9 Finally, learning from our investigation, we release SciLit01, a strong 8B baseline model SFTed from Qwen3-Base using our Math+STEM data composition. Our data composition is competitive on scientific reasoning among concurrent efforts on reasoning enhancement SFT
1
0
2
5/9 We SFT models in a controlled way on different sources of data. Using KRUX we find: (i) Retrieving task-relevant knowledge from parameters is a key bottleneck; (ii) Reasoning-fine-tuned models show complementary gains from explicit knowledge access; (iii) Long CoT
1
0
2
4/9 Next, we design a new framework to study separate roles of knowledge vs reasoning: KRUX (Knowledge& Reasoning Utilization eXams). It pulls atomic “knowledge ingredients” (KIs) from reasoning traces. Prepend those KIs to the original question and test another model. KRUX
1
0
2
3/9 Evaluating frontier models on SciReas, we observe patterns that otherwise remain obscure if just looking at individual benchmarks. Different LLMs may have their expertise in different tasks, even the same LLM can have significant performance gaps under different reasoning
1
0
2
2/9 We introduce SciReas, and SciReas-Pro, efficient, comprehensive and reasoning-focused benchmarks to evaluate scientific problem-solving.
1
0
2
1/9 🚀 New paper: Demystifying Scientific Problem-Solving in LLMs — How does reasoning enhancement affect knowledge recall, and do LLMs benefit from external knowledge complimentary to reasoning? Tldr; 📊 SciReas: holistic and efficient evaluation suite for scientific reasoning
1
2
16
Update: It’s happening at 2pm! Exciting journey, Come and join us!
Today at 4 PM, we’re presenting our tutorial: “Evaluating LLM-based Agents: Foundations, Best Practices, & Open Challenges” If you’re in Montreal for @IJCAIconf, come join us to dive into the future of #AgentEvaluation! 🇨🇦🤖 w. @RoyBarHaim @LilachEdel and @alanli2020
0
0
6
Excited for the release of SciArena with @allen_ai! LLMs are now an integral part of research workflows, and SciArena helps measure progress on scientific literature tasks. Also checkout the preprint for a lot more results/analyses. Led by: @YilunZhao_NLP, @kaiyan_z 📄 paper:
Introducing SciArena, a platform for benchmarking models across scientific literature tasks. Inspired by Chatbot Arena, SciArena applies a crowdsourced LLM evaluation approach to the scientific domain. 🧵
1
12
82
Excited to see more investigation into LLM creativity. We have some pioneering work on this topic as well: Creativity or Brute Force? Using Brainteasers as a Window into the Problem-Solving Abilities of Large Language Models. https://t.co/QNyQp1Zs80.
🚨 New study on LLM's reasoning boundary! Can LLMs really think out of the box? We introduce OMEGA—a benchmark probing how they generalize: 🔹 RL boosts accuracy on slightly harder problems with familiar strategies, 🔹 but struggles with creative leaps & strategy composition. 👇
0
10
18
🔥 Excited to share MetaFaith: Understanding and Improving Faithful Natural Language Uncertainty Expression in LLMs🔥 How can we make LLMs talk about uncertainty in a way that truly reflects what they internally "know"? Check out our new preprint to find out! Details in 🧵(1/n):
2
4
13
Besides natural language and formal language, truth table is also a great media for logical reasoning with a synergistic effect. Check out this cool idea from @LichangChen2!
Learn to Reason via Mixture-of-Thought Interesting paper to improve LLM reasoning utilizing multiple reasoning modalities: - code - natural language - symbolic (truth-table) representations Cool idea and nice results. My notes below:
2
5
17
⚠️ Attention: The site is currently down. Our engineering team is investigating. We will update as soon as possible. You can track progress here: https://t.co/y7aRh5SBN4 Sorry for any inconvenience.
221
202
788
Happy to share we received best paper at NENLP workshop at Yale 🥳🥳! tldr: Current alignment methods give excessive discretion to annotators in defining what good behavior means. This means we don't know what we are aligning to ‼️ We formalize discretion in alignment and
3
4
23