
Xing Han Lu
@xhluca
Followers
2K
Following
6K
Media
214
Statuses
2K
Vibe agents @Mila_Quebec @McGill_NLP
The Wired
Joined December 2017
RT @MassCaccia: π Our paper βπ»ππ€ π‘π πππππ πππ’π πΏπΏπ πππ π΄ππππ‘: π΄ ππ‘ππ‘ππ π‘ππππ π·ππππππ ππ β got an π¨π«ππ₯ at next weekβs πππ π πͺπΌπΏπΈππ΅πΌπ½ πΌπ» ππΌπΊπ½πππ²πΏβ¦.
0
26
0
AgentRewardBench will be presented at @COLM_conf 2025 in Montreal! See you soon and ping me if you want to meet up!.
AgentRewardBench: Evaluating Automatic Evaluations of Web Agent Trajectories . We are releasing the first benchmark to evaluate how well automatic evaluators, such as LLM judges, can evaluate web agent trajectories. We find that rule-based evals underreport success rates, and
2
7
31
RT @yoavartzi: @COLM_conf decisions are out, and so are we . The strength of submissions this year amazed us! Many many hard decisions π©β¦.
0
8
0
RT @kyutai_labs: Kyutai TTS and Unmute are now open source!.The text-to-speech is natural, customizable, and fast: it can serve 32 users wiβ¦.
0
172
0
RT @BlackboxNLP: π¨ Excited to announce two invited speakers at #BlackboxNLP 2025!. Join us to hear from two leading voices in interpretabilβ¦.
0
10
0
RT @vernadankers: I miss Edinburgh and its wonderful people already!! Thanks to @tallinzen and @PontiEdoardo for inspiring discussions duriβ¦.
0
8
0
RT @xhluca: @webagentlab Would appreciate if the authors could avoid copying the title of our paper, which was release more than 2 months aβ¦.
0
1
0
RT @benno_krojer: Started a new podcast with @tvergarabrowne !. Behind the Research of AI: .We look behind the scenes, beyond the polishedβ¦.
0
13
0
RT @cesare_spinoso: A blizzard is raging in Montreal when your friend says βWow, the weather is amazing!β Humans easily interpret irony, whβ¦.
0
11
0
RT @benno_krojer: The video is online now!. 3min speed science talk on "From a soup of raw pixels to abstract meaning". .
0
6
0
RT @ReviewAcl: Dear ACL community, We are seeking emergency reviewers for the May cycle. Please indicate your availability (ASAP) if you caβ¦.
0
16
0
Very important benchmark about the safety of computer use agents. Validates our findings in SafeArena ( that agents can complete harmful tasks - now with reasoning models and on OS tasks. We need safer digital agents asap before more productization.
π¨Excited to release OS-Harm! π¨. The safety of computer use agents has been largely overlooked. We created a new safety benchmark based on OSWorld for measuring 3 broad categories of harm:.1. deliberate user misuse,.2. prompt injections,.3. model misbehavior.
0
7
25
RT @maksym_andr: π¨Excited to release OS-Harm! π¨. The safety of computer use agents has been largely overlooked. We created a new safety bβ¦.
0
27
0
RT @hanseok_oh: Life update: I am joining as visiting researcher at @Mila_Quebec π¨π¦. I returned to academia to deepen my understanding of hβ¦.
0
4
0