Huanzhi Mao @HuanzhiMao X Profile

Huanzhi Mao

@HuanzhiMao

Followers

110

Following

115

Media

4

Statuses

73

Project Lead of BFCL (Berkeley Function Calling Leaderboard), CS PhD @UCBerkeley

https://t.co/4UVW171IUu

Joined August 2022

Don't wanna be here? Send us removal request.

Salesforce AI Research

@SFResearch

2 months

🌟 Happy National Intern Day! Today we celebrate the brilliant minds and diverse perspectives that our interns bring to @SFResearch. Our interns contribute to groundbreaking AI research from day one, bringing fresh ideas that drive innovation and solve complex problems for

0

11

28

Oleksii Kuchaiev

@kuchaev

3 months

Very excited to announce Llama-Nemotron-Super-V1.5! Super-V1.5 is now better than Ultra-V1. This is currently the best model that can be deployed on a single H100. Reasoning On/Off and drop in replacement for V1. Open-weight, code and data on HF https://t.co/bePZEQJllC

8

43

190

Qwen

@Alibaba_Qwen

3 months

>>> Qwen3-Coder is here! ✅ We’re releasing Qwen3-Coder-480B-A35B-Instruct, our most powerful open agentic code model to date. This 480B-parameter Mixture-of-Experts model (35B active) natively supports 256K context and scales to 1M context with extrapolation. It achieves

337

2K

9K

Qwen

@Alibaba_Qwen

3 months

Bye Qwen3-235B-A22B, hello Qwen3-235B-A22B-2507! After talking with the community and thinking it through, we decided to stop using hybrid thinking mode. Instead, we’ll train Instruct and Thinking models separately so we can get the best quality possible. Today, we’re releasing

212

578

4K

Huanzhi Mao

@HuanzhiMao

3 months

OpenAI #BrowseComp blog's demo question cites EMNLP’21 “Frequency Effects on Syntactic Rule Learning in Transformers,” saying its 4th author did undergrad at UPenn. But Ellie Pavlick’s own CV lists JHU. If the sample question label is even wrong, how reliable is the benchmark?🤔

0

Shishir Patil

@shishirpatil_

3 months

🔥 At ICML 2025, we’re delighted to introduce BFCL V4 Agentic. As function-calling (also called tool-calling) forms the bed-rock of Agentic systems, BFCL V4 Agentic benchmark focuses on tool-calling in real-world agentic settings — including: 🔍 Web search with multi-hop

1

10

19

Shishir Patil

@shishirpatil_

3 months

Since then, there’s been no looking back. Function-calling evaluation turned out to be a far richer with many more open research questions than anyone anticipated. 🧠 BFCL v1 introduced AST (Abstract Syntax Tree) based evaluation — still the gold standard for zero-shot

2

7

Shishir Patil

@shishirpatil_

3 months

📢 Big update: Introducing BFCL V4 Agentic — and BFCL published at ICML 2025! 🏟️ Some BFCL lore... back in 2022, as researchers we couldn't find good open-source models that could handle zero-shot function calling — so we decided to train our own. Sounds simple, right? It was!

6

12

56

Nous Research

@NousResearch

6 months

We also release a collection of artifacts created using environments in Atropos, including one new dataset and five new models for tool calling, corporate fundamentals prediction and new, experimental personalities with RLAIF. https://t.co/ZzGdB3lT9S On tool calling, our

2

9

130

Qwen

@Alibaba_Qwen

6 months

Introducing Qwen3! We release and open-weight Qwen3, our latest large language models, including 2 MoE models and 6 dense models, ranging from 0.6B to 235B. Our flagship model, Qwen3-235B-A22B, achieves competitive results in benchmark evaluations of coding, math, general

352

2K

8K

Salesforce AI Research

@SFResearch

6 months

Our xLAM (#LargeActionModels) family just got an upgrade! 1️⃣ Multi-turn, natural conversation support 2️⃣ Smarter multi-step reasoning 3️⃣ Models from 1B to 70B for ultimate flexibility 🤗 HuggingFace: https://t.co/SWsQXeURtv 👑 BFCL Leaderboard: https://t.co/n3P5o2nS6S Our

0

20

53

Qwen

@Alibaba_Qwen

7 months

Today, we release QwQ-32B, our new reasoning model with only 32 billion parameters that rivals cutting-edge reasoning model, e.g., DeepSeek-R1. Blog: https://t.co/zCgACNdodj HF: https://t.co/pfjZygOiyQ ModelScope: https://t.co/hcfOD8wSLa Demo: https://t.co/DxWPzAg6g8 Qwen Chat:

482

2K

9K

Alejandro Cuadron

@Alex_Cuadron

8 months

New discovery! LLMs are just like humans! Overthinking GREATLY HURTS their performance If we select the solution with the lower overthinking score. We improve model performance by almost 30% while reducing costs by 43% (o1_low) Is reasoning really the future of LLMs? 🧵

42

165

1K

Shishir Patil

@shishirpatil_

1 year

🙇We are delighted to see BFCL become the gold standard for evaluating LLM's ability to invoke functions, and we are grateful for the community's continued feedback. We would like to thank all enterprises, and open-source contributors who contributed to this - please keep the

1

2

8

Shishir Patil

@shishirpatil_

1 year

If a model doesn't interact with the backend state, it is destined to fail. A model tries to create a directory that already exists because it fails to check the current state. This highlights how models still struggle with *planning, error recovery, and implicit decision-making*

2

12

Shishir Patil

@shishirpatil_

1 year

LLM Agents have characters. Our observation from evaluating BFCL V3 provide insights that some LLMs tend to explore the internal state space first then perform actions. Others tend to be brave and directly perform actions according to user instruction. There isn't a optimal

1

2

9

Shishir Patil

@shishirpatil_

1 year

Short context models should be out of the picture. When asked to sort through hundreds of files or navigate long booking lists, can an AI focus on what’s important? BFCL V3 throws models into these long-context situations, ensuring they can filter out noise and retrieve what

1

2

7

Shishir Patil

@shishirpatil_

1 year

BFCL V3’s dataset is carefully designed. From Base Multi-Turn category (not to be confused with multi-step) to Long-Context and Follow-up challenges, each category is crafted to cover both basic and highly complex interactions. This ensures a well-rounded evaluation of LLMs,

1

2

8

Shishir Patil

@shishirpatil_

1 year

LLMs need to make an effort to probe the state. In BFCL V3, We evaluate models with an internal state *invisible* to the LLMs. Did the stock purchase actually go through? Did the file system update correctly? LLMs need to collect information from the internal state through APIs

1

2

11

Shishir Patil

@shishirpatil_

1 year

📣 Announcing BFCL V3 - evaluating how LLMs handle multi-turn, and multi-step function calling! 🚀 For agentic systems, function calling is critical, but a model needs to do more than single-turn tasks. Can it manage multi-turn workflows, handle sequential functions, and adapt to

9

41

204