jasonwu0731 Profile Banner
Chien-Sheng (Jason) Wu Profile
Chien-Sheng (Jason) Wu

@jasonwu0731

Followers
2K
Following
1K
Media
11
Statuses
326

Director at @SFResearch. Working on #NLProc. Opinions are my own.

Palo Alto, CA
Joined May 2012
Don't wanna be here? Send us removal request.
@jasonwu0731
Chien-Sheng (Jason) Wu
8 days
RT @SFResearch: The new synthetic data training grounds advancing #EGI 🚨. Hot off the press: How @Salesforce AI Research is mirroring the s….
0
2
0
@jasonwu0731
Chien-Sheng (Jason) Wu
1 month
šŸ”Deep Search ≠ Deep Research. It’s not about browsing, insight mining, coding, or report writing—it’s about retrieving signal from messy, scattered data: GDocs, Slack, Meeting, GitHub, OrgCharts, etc. Agents must reason across it all, know what to search and where to search!.
@SFResearch
Salesforce AI Research
1 month
🧪HERB - a benchmark that puts RAG systems to the test with real enterprise challenges!. šŸ“Š Even our best agentic RAG systems only hit 30% accuracy when dealing with scattered info across Slack, GitHub, docs & meetings. šŸ” Key finding: Retrieval is the main bottleneck, not
Tweet media one
0
2
18
@jasonwu0731
Chien-Sheng (Jason) Wu
2 months
RT @PranavVenkit: Thank you @FAccTConference! Presenting my work here was extremely fun. Such a great place for people who conduct interdis….
0
1
0
@jasonwu0731
Chien-Sheng (Jason) Wu
2 months
RT @omarsar0: @karpathy Great share as usual! Just read this related piece where a study showed issues with LLM-based agents not recognizin….
Tweet card summary image
arxiv.org
While AI agents hold transformative potential in business, effective performance benchmarking is hindered by the scarcity of public, realistic business data on widely used platforms. Existing...
0
5
0
@jasonwu0731
Chien-Sheng (Jason) Wu
2 months
RT @CaimingXiong: Introducing new SOTA GUI grounding model -- šŸ”„Grounding-R1šŸ”„ for Computer-Use Agent. Key insights:.1. "Thinking" is not re….
0
29
0
@jasonwu0731
Chien-Sheng (Jason) Wu
2 months
RT @ChombaBupe: Another paper drop, this time from Salesforce:. "These results underscore a significant gap between current LLM.capabilitie….
0
102
0
@jasonwu0731
Chien-Sheng (Jason) Wu
2 months
RT @Marktechpost: Salesforce AI Introduces CRMArena-Pro: The First Multi-Turn and Enterprise-Grade Benchmark for LLM Agents. Researchers fr….
0
9
0
@jasonwu0731
Chien-Sheng (Jason) Wu
2 months
RT @silviocinguetta: Synthesized data for #EnterpriseAI evaluation is an ethical imperative. CRMArena-Pro lets us rigorously test agents in….
0
5
0
@jasonwu0731
Chien-Sheng (Jason) Wu
2 months
RT @TeksEdge: Now that we are fully immersed in the year of Agentic AI, nobody has created a benchmark to evaluate how well these agents wo….
0
4
0
@jasonwu0731
Chien-Sheng (Jason) Wu
2 months
RT @CaimingXiong: AI agents are rapidly integrating into various industries, however their full potential remains underutilized due to perf….
0
19
0
@jasonwu0731
Chien-Sheng (Jason) Wu
2 months
RT @SFResearch: ⚔ NEW COMPUTER-USE AI RESEARCH ⚔. Introducing:. 1ļøāƒ£ Our paper, Scaling Computer-Use Grounding via User Interface Decomposi….
0
15
0
@jasonwu0731
Chien-Sheng (Jason) Wu
3 months
RT @steeve__huang: 🚨 The Business AI Plot Thickens 🚨. CRMArena set the stage for business AI evaluation in realistic environments. Now we'r….
0
10
0
@jasonwu0731
Chien-Sheng (Jason) Wu
3 months
Enterprise Synthetic Data + LLM Agent Benchmarking + Actionable Insights = Towards Better Enterprise AI Agents. Check out the new CRMArena-Pro:.– B2B/B2C scenarios.– Service/Sales/CPQ tasks.– Single/Multi-turn testing.– Confidentiality awareness eval. Let’s go Enterprise Agents!.
@SFResearch
Salesforce AI Research
3 months
🚨 Introducing CRMArena-Pro: The first multi-turn, enterprise-grade benchmark for LLM agents. āœļøBlog: šŸ–‡ļøPaper: šŸ¤—Dataset: šŸ–„ļøCode: Most AI benchmarks test isolated, single-turn tasks.
Tweet media one
2
1
20
@jasonwu0731
Chien-Sheng (Jason) Wu
3 months
RT @SFResearch: 🚨 Introducing CRMArena-Pro: The first multi-turn, enterprise-grade benchmark for LLM agents. āœļøBlog: .
0
26
0
@jasonwu0731
Chien-Sheng (Jason) Wu
3 months
When fine-tuning your RAG pipeline or AI agent responses, don’t overlook how your system handles unanswerable queries. Use our evaluation pipeline to avoid overpromising — transparency builds trust, especially when you don’t know the answer. #ACL2025.
@SFResearch
Salesforce AI Research
3 months
🧠 RAG systems excel at answering questions—but what happens when there's no answer?. We introduce UAEval4RAG, a framework to evaluate how well RAG models handle unanswerable queries. šŸ“„ Paper: šŸ”— Code: By categorizing
Tweet media one
0
3
14
@jasonwu0731
Chien-Sheng (Jason) Wu
3 months
Top 2 takeaways from our work:. 1. VLM visual features do contain info for visual arithmetic—but without fine-tuning a strong decoder, it remains locked. 2. Training VLMs on just 8 invariant properties can enhance chart and visual math tasks, matching SFT with 60% less data.
@steeve__huang
Kung-Hsiang Steeve Huang
3 months
Excited to share that CogAlign is accepted at #ACL2025 Findings! We investigated the "Jagged Intelligence" of VLMs – their surprising difficulty with basic visual arithmetics (e.g., counting objects, measuring angles) compared to their strong performance on harder visual tasks.
0
2
9
@jasonwu0731
Chien-Sheng (Jason) Wu
4 months
RT @steeve__huang: Excited to present our CRMArena paper at #NAACL2025 as an oral presentation! šŸŽ‰. ā° Tomorrow (April 30) 16:00-17:30.šŸ“ Ball….
0
3
0
@jasonwu0731
Chien-Sheng (Jason) Wu
4 months
Check our work at #NAACL2025!. ✨ CRMArena: Enterprise synthetic data and agent eval @steeve__huang . ✨ Evaluate RAG with sub-question coverage @KaigeXie . ✨ Cultural and Social Awareness of LLM Agents @HaoyiQiu . ✨ ReIFE: Meta-eval of instruction-following. @YixinLiu17
Tweet media one
0
5
27
@jasonwu0731
Chien-Sheng (Jason) Wu
4 months
RT @infwinston: It’s been a two-year journey since we launched the very first Arena leaderboard!. What began as a weekend side project has….
0
4
0
@jasonwu0731
Chien-Sheng (Jason) Wu
4 months
RT @windx0303: @TuhinChakr is on stage! #CHI2025
Tweet media one
0
4
0