Huanzhi Mao Profile
Huanzhi Mao

@HuanzhiMao

Followers
110
Following
115
Media
4
Statuses
73

Project Lead of BFCL (Berkeley Function Calling Leaderboard), CS PhD @UCBerkeley

Joined August 2022
Don't wanna be here? Send us removal request.
@SFResearch
Salesforce AI Research
2 months
🌟 Happy National Intern Day! Today we celebrate the brilliant minds and diverse perspectives that our interns bring to @SFResearch. Our interns contribute to groundbreaking AI research from day one, bringing fresh ideas that drive innovation and solve complex problems for
0
11
28
@kuchaev
Oleksii Kuchaiev
3 months
Very excited to announce Llama-Nemotron-Super-V1.5! Super-V1.5 is now better than Ultra-V1. This is currently the best model that can be deployed on a single H100. Reasoning On/Off and drop in replacement for V1. Open-weight, code and data on HF https://t.co/bePZEQJllC
8
43
190
@Alibaba_Qwen
Qwen
3 months
>>> Qwen3-Coder is here! ✅ We’re releasing Qwen3-Coder-480B-A35B-Instruct, our most powerful open agentic code model to date. This 480B-parameter Mixture-of-Experts model (35B active) natively supports 256K context and scales to 1M context with extrapolation. It achieves
337
2K
9K
@Alibaba_Qwen
Qwen
3 months
Bye Qwen3-235B-A22B, hello Qwen3-235B-A22B-2507! After talking with the community and thinking it through, we decided to stop using hybrid thinking mode. Instead, we’ll train Instruct and Thinking models separately so we can get the best quality possible. Today, we’re releasing
212
578
4K
@HuanzhiMao
Huanzhi Mao
3 months
OpenAI #BrowseComp blog's demo question cites EMNLP’21 “Frequency Effects on Syntactic Rule Learning in Transformers,” saying its 4th author did undergrad at UPenn. But Ellie Pavlick’s own CV lists JHU. If the sample question label is even wrong, how reliable is the benchmark?🤔
0
0
0
@shishirpatil_
Shishir Patil
3 months
🔥 At ICML 2025, we’re delighted to introduce BFCL V4 Agentic. As function-calling (also called tool-calling) forms the bed-rock of Agentic systems, BFCL V4 Agentic benchmark focuses on tool-calling in real-world agentic settings — including: 🔍 Web search with multi-hop
1
10
19
@shishirpatil_
Shishir Patil
3 months
Since then, there’s been no looking back. Function-calling evaluation turned out to be a far richer with many more open research questions than anyone anticipated. 🧠 BFCL v1 introduced AST (Abstract Syntax Tree) based evaluation — still the gold standard for zero-shot
2
2
7
@shishirpatil_
Shishir Patil
3 months
📢 Big update: Introducing BFCL V4 Agentic — and BFCL published at ICML 2025! 🏟️ Some BFCL lore... back in 2022, as researchers we couldn't find good open-source models that could handle zero-shot function calling — so we decided to train our own. Sounds simple, right? It was!
6
12
56
@NousResearch
Nous Research
6 months
We also release a collection of artifacts created using environments in Atropos, including one new dataset and five new models for tool calling, corporate fundamentals prediction and new, experimental personalities with RLAIF. https://t.co/ZzGdB3lT9S On tool calling, our
2
9
130
@Alibaba_Qwen
Qwen
6 months
Introducing Qwen3! We release and open-weight Qwen3, our latest large language models, including 2 MoE models and 6 dense models, ranging from 0.6B to 235B. Our flagship model, Qwen3-235B-A22B, achieves competitive results in benchmark evaluations of coding, math, general
352
2K
8K
@SFResearch
Salesforce AI Research
6 months
Our xLAM (#LargeActionModels) family just got an upgrade! 1️⃣ Multi-turn, natural conversation support 2️⃣ Smarter multi-step reasoning 3️⃣ Models from 1B to 70B for ultimate flexibility 🤗 HuggingFace: https://t.co/SWsQXeURtv 👑 BFCL Leaderboard: https://t.co/n3P5o2nS6S Our
0
20
53
@Alibaba_Qwen
Qwen
7 months
Today, we release QwQ-32B, our new reasoning model with only 32 billion parameters that rivals cutting-edge reasoning model, e.g., DeepSeek-R1. Blog: https://t.co/zCgACNdodj HF: https://t.co/pfjZygOiyQ ModelScope: https://t.co/hcfOD8wSLa Demo: https://t.co/DxWPzAg6g8 Qwen Chat:
482
2K
9K
@Alex_Cuadron
Alejandro Cuadron
8 months
New discovery! LLMs are just like humans! Overthinking GREATLY HURTS their performance If we select the solution with the lower overthinking score. We improve model performance by almost 30% while reducing costs by 43% (o1_low) Is reasoning really the future of LLMs? 🧵
42
165
1K
@shishirpatil_
Shishir Patil
1 year
🙇We are delighted to see BFCL become the gold standard for evaluating LLM's ability to invoke functions, and we are grateful for the community's continued feedback. We would like to thank all enterprises, and open-source contributors who contributed to this - please keep the
1
2
8
@shishirpatil_
Shishir Patil
1 year
If a model doesn't interact with the backend state, it is destined to fail. A model tries to create a directory that already exists because it fails to check the current state. This highlights how models still struggle with *planning, error recovery, and implicit decision-making*
2
2
12
@shishirpatil_
Shishir Patil
1 year
LLM Agents have characters. Our observation from evaluating BFCL V3 provide insights that some LLMs tend to explore the internal state space first then perform actions. Others tend to be brave and directly perform actions according to user instruction. There isn't a optimal
1
2
9
@shishirpatil_
Shishir Patil
1 year
Short context models should be out of the picture. When asked to sort through hundreds of files or navigate long booking lists, can an AI focus on what’s important? BFCL V3 throws models into these long-context situations, ensuring they can filter out noise and retrieve what
1
2
7
@shishirpatil_
Shishir Patil
1 year
BFCL V3’s dataset is carefully designed. From Base Multi-Turn category (not to be confused with multi-step) to Long-Context and Follow-up challenges, each category is crafted to cover both basic and highly complex interactions. This ensures a well-rounded evaluation of LLMs,
1
2
8
@shishirpatil_
Shishir Patil
1 year
LLMs need to make an effort to probe the state. In BFCL V3, We evaluate models with an internal state *invisible* to the LLMs. Did the stock purchase actually go through? Did the file system update correctly? LLMs need to collect information from the internal state through APIs
1
2
11
@shishirpatil_
Shishir Patil
1 year
📣 Announcing BFCL V3 - evaluating how LLMs handle multi-turn, and multi-step function calling! 🚀 For agentic systems, function calling is critical, but a model needs to do more than single-turn tasks. Can it manage multi-turn workflows, handle sequential functions, and adapt to
9
41
204