
Huan Sun (OSU)
@hhsun1
Followers
5K
Following
2K
Media
49
Statuses
720
Associate Professor (with Tenure) in CSE, endowed CoE Innovation Scholar, CoP Co-Director @OSUbigdata, The Ohio State University (NLP and Data Mining)
The Ohio State University
Joined March 2012
More details about Mind2Web 2: .Explore examples here:
🔎Agentic search like Deep Research is fundamentally changing web search, but it also brings an evaluation crisis⚠️. Introducing Mind2Web 2: Evaluating Agentic Search with Agents-as-a-Judge.- 130 tasks (each requiring avg. 100+ webpages) from 1,000+ hours of expert labor.-
0
0
3
RT @_zifan_wang: 🧵 (6/9) Our findings suggest that traditional capability benchmarks may not adequately assess search-base LLM agents. We r….
arxiv.org
Agentic search such as Deep Research systems-where agents autonomously browse the web, synthesize information, and return comprehensive citation-backed answers-represents a major shift in how...
0
4
0
Glad that @scale_AI team specifically studies this "Search-Time Data Contamination" (STC) problem of existing agentic search benchmarks, where an agent might just retrieve a source that directly leaks the answer - no actual reasoning or complex web navigation is needed. In.
🧵 (1/9) New @scale_AI research paper: "Search-Time Data Contamination" (STC), which occurs in evaluating search-based LLM agents when the retrieval step contains clues about a question’s answer by virtue of being derived from the evaluation set itself.
3
5
17
Glad that @scale_AI team specifically studies this "Search-Time Data Contamination" (STC) problem of existing agentic search benchmarks, where an agent might just retrieve a source that directly leaks the answer - no actual reasoning or complex web navigation is needed. In.
🧵 (1/9) New @scale_AI research paper: "Search-Time Data Contamination" (STC), which occurs in evaluating search-based LLM agents when the retrieval step contains clues about a question’s answer by virtue of being derived from the evaluation set itself.
0
0
3
RT @xiangyue96: 20 months after our multimodal reasoning benchmark MMMU ( release, both frontier and open models ar….
0
18
0
RT @CaimingXiong: 🚀 Computer-using agents represent a powerful new paradigm for human-computer interaction. Over the past year, we’ve explo….
0
41
0
RT @AnthropicAI: New Anthropic research: Building and evaluating alignment auditing agents. We developed three AI agents to autonomously c….
0
198
0
The System Two Safety idea is very interesting. In environments that may have maliciously injected instructions, deliberative reasoning seems particularly important: . Are the instructions in the current context out of place or suspicious, e.g., see Indirect Prompt Injection
I'm a co-author on this new paper on Chain of Thought Monitoring:. It's very related to the System Two Safety idea that I've been talking about for a few years:.
1
0
8
RT @AbrahamOwos: Yaaaay! our paper got a best social impact award at ACL!!!🎉🎉. I couldn't attend the conference sadly.
0
15
0
RT @nouhadziri: Come join us tomorrow at the 1st LLM agents workshop in ACL (REALM); amazing talks and oral presentations are ahead, with a….
0
17
0
RT @boyuan__zheng: Remember “Son of Anton” from the Silicon Valley show(@SiliconHBO)? The experimental AI that “efficiently” orders 4,000 l….
0
28
0
Check out our WebGuard led by @boyuan__zheng: the first large-scale dataset for training and evaluating guardrails to detect consequential web agent actions (actions with significant or irreversible consequences): 📊 4,939 human-labeled actions.📷 193 websites across 22 domains.
Remember “Son of Anton” from the Silicon Valley show(@SiliconHBO)? The experimental AI that “efficiently” orders 4,000 lbs of meat while looking for a cheap burger and “fixes” a bug by deleting all the code?. It’s starting to look a lot like reality. Even 18 months ago, my own
0
3
20
RT @scale_AI: As AI agents start taking real actions online, how do we prevent unintended harm?. We teamed up with @OhioState and @UCBerkel….
0
22
0
RT @AnaisHowland18: WebGuard is a big step for web-agent safety: ~5k human-tagged actions across 193 sites. Frontier LLMs hit <60% on high-….
0
3
0
RT @niloofar_mire: I’m gonna be recruiting students thru both @LTIatCMU (NLP) and @CMU_EPP (Engineering and Public Policy) for fall 2026!….
0
50
0