
Zora Wang
@ZhiruoW
Followers
1K
Following
880
Media
34
Statuses
224
PhD student @LTIatCMU + visiting @StanfordNLP | prev @Amazon Alexa AI, @Microsoft Research, Asia | fun 👩🏻💻 🐈 💃 🪴 🎶
Joined August 2021
I will give a tutorial about "LLM for Tabular Data" at @SIGIRConf from 9-12:30 today, together with @HaoyuDong9, my previous mentor at Microsft Research, Asia!.Join our tutorial in person, or check our slides/recording online if you're interested!.
4
32
186
Heading to #EMNLP2023 next week ✈️.If you’re interested in Code Generation🧑💻don't hesitate to check out our two papers!. - ODEX, A challenging benchmark with open-domain coding queries: - API-assisted code generation for tableQA:
2
22
94
Couldn't agree more on agent "continually adapt" from "streamed experiences"!.This is exactly what we've envisioned in building online adaptive agents with self-induced evolving memory & skills in AWM ( and ASI (!.Yet still some.
David Silver really hits it out of the park in this podcast. The paper "Welcome to the Era of Experience" is here:
1
66
68
Excited to share that our survey has been accepted at the very first @COLM_conf!.Check out our paper if you want to learn more about tool-augmented language models👩🔧.
Tools can empower LMs to solve many tasks. But what are tools anyway?. Our survey studies tools for LLM agents w/.–A formal def. of tools.–Methods/scenarios to use&make tools.–Issues in testbeds and eval metrics.–Empirical analysis of cost-gain trade-off
1
5
68
Due to popular demand, we released our pre-trained models for filtering context to improve RAG accuracy & efficiency: 🌐[open] for general Wikipedia queries.🧐[multihop] for complex multi-hop tasks.💬[fact] & [dialog] for fact verification and dialog tasks
Everyone is using RAG, but most of the retrieved context is noisy! 🚨.Introducing FilCo: “Learning to Filter Context for Retrieval-Augmented Generation”. TL;DR: Get rid of the irrelevant content using FilCo, and you'll get better outputs. Preprint:
0
10
65
Excited to share that our CowPilot🐮 is accepted to #NAACL 2025 Demo Track! .Definitely check out our user study if you're interested in trying out CowPilot:
[1/6] 🤔 Ever wondered if you could collaborate with an agent on web tasks?. We present CowPilot 🐮, a framework for human-agent collaboration in web navigation that allows humans to intervene dynamically. 📄 🌐
0
72
38
I'll be at the @COLM_conf from Oct 7-9th, and present our work on Monday morning poster session.Come check out our poster if you're interested!.Also feel free to DM me if you want to talk about code generation, tools, and agents 🙌.
Tools can empower LMs to solve many tasks. But what are tools anyway?. Our survey studies tools for LLM agents w/.–A formal def. of tools.–Methods/scenarios to use&make tools.–Issues in testbeds and eval metrics.–Empirical analysis of cost-gain trade-off
0
5
55
Excited to share that AWM has been accepted at #ICML2025 🥳.Check out our online memory-adaptive agent if you haven't! 🔗
How can we create AI agents that continually improve, learning from past successes?.Presenting 🌟Agent Workflow Memory🌟, which allows agents to induce, learn, and use task workflows from experiences on the fly🪽.Adding AWM to a strong agent improves accuracy by 51.1% on
1
25
50
Our arxiv preprint is released now!.🔗: If you know other awesome papers on tool use in LLMs, please let us know and feel free to open a PR!.👩💻:
Tools can empower LMs to solve many tasks. But what are tools anyway?. Our survey studies tools for LLM agents w/.–A formal def. of tools.–Methods/scenarios to use&make tools.–Issues in testbeds and eval metrics.–Empirical analysis of cost-gain trade-off
1
10
54
Totally agree that agents should improve throughout streamed eval! Happy to see Agent Workflow Memory ( mentioned as an example😎.Also read our recent work ASI (Agent Skill Induction) that induces programmatic skills on the fly and further boosts success &.
I finally wrote another blogpost: AI just keeps getting better over time, but NOW is a special moment that i call “the halftime”. Before it, training > eval. After it, eval > training. The reason: RL finally works. Lmk ur feedback so I’ll polish it.
1
47
40
Short notice but I'll give another talk at the multi-agent workshop ( on Mar 4th, 14:35 -15:10!.Catch me in the talk or panel session if you're still around 🤗.
📣📣 Attending #AAAI25 next week?.I will give two talks about "Agent Workflow Memory" and "The Agent Company", and sit at the panel afterward 🎤.Join me at the talk and panel sessions at the WebAgent workshop on Mar 3rd!.
0
72
27
Cannot attend #ICLR2025 in person (will be NAACL and Stanford soon!), but do check out 👇.▪️Apr 27: "Exploring the Pre-conditions for Memory-Learning Agents" led by @viishruth and Vishwa Shah, at SSI-FM workshop.▪️Apr 28: our @DL4Code workshop with a fantastic line of works &.
Just 6 days until #DL4C! 🗓️ Daniel Fried (CMU / Meta AI) @dan_fried @AIatMeta will be sharing insights on how inducing functions from code makes LLM agents smarter and more efficient. Don't miss it! See you Sunday! #ICLR2025 #iclr.
0
58
24
I will be at #NAACL2025 with my cat 🐈, presenting three works:.- Apr 30: CodeRAG-Bench co-led with @AkariAsai: - May 2: CowPilot led by @FariaHuqOaishi: - May 2: Fail-TaLMs led by Eduardo Treviño and Hugo Contant, that benchmark.
1
50
35
Excited to share that our TroVE is now accepted at #ICML2024! Building TroVE is such a fun experience to explore LM autonomy 🥳.I will also present this work at our CMU Agent Workshop ( today, stop by if you're interested!.
Do you find LM-written programs too complex to understand?.Do bugs often pop up in these solutions?. Check out *TroVE*, a training-free method to create accurate, concise, and verifiable solutions by inducing tools.🔗:
1
6
38
A fantastic trip to Urbana-Champaign🚜 and gave a talk at the UIUC iSE Speaker Series about "Solving Real-World Tasks via Program Generation". Many thanks to my hosts @LingmingZhang @JiaweiLiu_ and Yinlin Deng!!.Beautiful campus, great Chinese food, and a lot of corn fields🌽 ;).
3
1
34
Current methods improve program efficiency at the cost of sacrificing correctness 😟.Check out our new benchmark, ECCO, targeting correctness-preserving program optimization!.As well as a full set of explorations on various methods 🌐.
Can current code LMs generate sufficiently efficient programs? 🤔.More importantly, Can these LMs improve code efficiency without sacrificing correctness?. Check out ECCO, our code-gen benchmark for correctness-preserving program optimizations!.🧵 1/n
0
5
30
Wonder how good agents are in real-world tasks⁉️.In TheAgentCompany, we create difficult tasks featuring varied skills (DS, research, finance, etc.), support evaluations end-to-end and on intermediate checkpoints, and further benchmark agents with top-performing open/closed LLMs.
How far are we from having competent AI co-workers that can perform tasks as varied as software development, project management, administration, and data science?. In our new paper, we introduce TheAgentCompany, a benchmark for AI agents on consequential real-world tasks.
1
5
25
Excited to co-organized the DL4C workshop at @iclr_conf'25. Check out our call for papers and submit your interesting codegen paper! 😉.
🚀Excited to share the 3rd Deep Learning for Code workshop is back at @iclr_conf'25! This year we’ll focus on emergent challenges in the field, e.g., agents, post-training, developer productivity, open science, and benchmarking for code Submit by Feb 3 �⬇️.
0
10
23
Excited to share that ECCO has been accept to #EMNLP2024 🥳 thanks to all the great work of Siddhant and @viishruth .Come check out our benchmark if you haven’t!.
Current methods improve program efficiency at the cost of sacrificing correctness 😟.Check out our new benchmark, ECCO, targeting correctness-preserving program optimization!.As well as a full set of explorations on various methods 🌐.
2
4
20
Sad to miss #EMNLP2024 but do check out our paper "ECCO: Can We Improve Model-Generated Code Efficiency Without Sacrificing Functional Correctness?" presented by @viishruth and Siddhant.Tuesday 11-12:30 at Poster Session 02‼️.
I’m attending #EMNLP2024 in Miami from 11-16th Nov to present ECCO on Tuesday 🏖️. Looking forward to meeting folks and chatting more about code generation and LLM agents!.
0
2
16
Super fun project w/ @maojiayuan, @dan_fried, and @gneubig❤️🔥.Check our code here: Don’t hesitate to reach out if you have any questions, thoughts, or feedback! 🤗.
1
0
10
Unfortunately I'm not able to attend ICML due to visa issues🥲.But do check our poster at Hall C 4-9 #615, Thu 1:30-3 pm ‼️.I'm also happy to chat online about TroVE, or any topics related to code gen, tool use, and agents.
Do you find LM-written programs too complex to understand?.Do bugs often pop up in these solutions?. Check out *TroVE*, a training-free method to create accurate, concise, and verifiable solutions by inducing tools.🔗:
5
2
10
Join us in the Agent Workshop! Will have a lot of fun: talks, tutorials, hackathon, posters, . 🥳.
On May 2-3, we're going to have a big event in Pittsburgh about LLM Agents. We have invited talks from great speakers inside and outside CMU, student research presentations and posters, tutorials and discussions! Come join us at CMU campus, and register at
0
0
9
Glad to see more works targeting open-domain library usage, just as our ODEX benchmark (.
Introducing 🌸BigCodeBench: Benchmarking Large Language Models on Solving Practical and Challenging Programming Tasks!. BigCodeBench goes beyond simple evals like HumanEval and MBPP and tests LLMs on more realistic and challenging coding tasks.
0
1
9
Our paper is unluckily on hold by arXiv, we’ll share the link once it’s ready!.Meanwhile, you can find.our paper at: and our awesome tool repo: Thanks to my collaborators @ChengZhoujun @_Hao_Zhu and advisors @dan_fried @gneubig.
2
2
7
Very pleasant experience co-leading the project w/ @AkariAsai, huge thanks to @XinyanVYu @frankxu2004 @YiqingXieNLP for the contributions, and @gneubig @dan_fried for the great advices 🫶. Add your data into CodeRAG-Bench? Create a PR!.Try more models? Submit to our leaderboard!.
0
1
8
When augmented with annotated or retrieved documents from canonical sources, pass@1 improves for most codegen models and on most tasks. Except that on open-domain tasks, GPT and DeepSeekCoder gain little from library docs, possibly due to their prior familiarity with these docs🧐
1
0
7
Joint work with @apurvasgandhi @gneubig @dan_fried. Check out the following resources for more details:.🔗Preprint: 💻Codebase:
0
0
6
Enjoyable works with my collaborators: @shuyanzhxyc, @dan_fried, @gneubig, and three students: Yihan, Shuyi, @theryanliu. Stay tuned for more info about our presentation. Happy to chat more at the conference too!.
0
0
3
Fun work w/ @gneubig and @dan_fried!.Check out our code repository at: Let us know what other interesting tasks you want to try TroVE on!.
0
2
5
Curious about how to generate image & videos with frozen LLMs? Come check our spotlight poster on 12/12, 17:15-19:15 #NeurIPS.
🙋How to do multimodal generation like Gemini with a text-only LLM without tuning?.✅SPAE Tokenizer is all you need!. 🔥NeurIPS’23 Spotlight⬇️.📑SPAE: Semantic Pyramid AutoEncoder for Multimodal Generation with Frozen LLMs.🕒12/12 Tue 17:15-19:15 CST.🌍Spotlight poster #118
0
0
4
Exciting news! 🥳.
I am thrilled to announce that I will be joining @DukeU . @dukecompsci as an Assistant Professor in summer 2025. Super excited for the next chapter! Stay tuned for the launch of my lab 🧠🤖
0
0
4
@ShunyuYao12 On the other hand, I think "using tools" is a subset of "taking actions". Agents can take actions through actuators, where actuators can be part of its body (move_arm) or external tools (use_rod).For both points you can find more detailed elaborations in section 2 in our survey!.
0
0
3
@EliasEskin @ArchikiPrasad @mohitban47 @uncnlp Interesting work! ReGAL seems to share a lot of findings with our work TroVE: (reusable tools, increased task accuracy, easier to use, etc.).Happy to see that this program abstraction idea can improve tasks in this work too!.
0
1
3
Filtering noisy contexts? or training more robust models?.
How Faithful are RAG Models? . This new paper aims to quantify the tug-of-war between RAG and LLMs' internal prior. It focuses on GPT-4 and other LLMs on question answering for the analysis. It finds that providing correct retrieved information fixes most of the model
0
0
3
@yugu_nlp Great question! An "optimal" order would potentially be arranging the examples from the easiest to the hardest. We stick to the original order because WebArena intentionally ordered their examples (in a way that previous tasks won't affect later ones). Still, that would be an.
0
0
2
@ShunyuYao12 I think "language agents" are a subset of "LM"s, whether you can call an LM as an LM-based agent depends on the task and its activities.
0
0
2
@storyweaver48 @SIGIRConf @HaoyuDong9 Thank you!! The video should be soon available on the ACM youtube channel:
1
0
1
@somakaditya @ChengZhoujun @_Hao_Zhu @dan_fried @gneubig Thanks for sharing your work! We also have a collection of awesome TALM papers in this repo: Free feel to create a PR and add your work!.
0
0
1
Check out more details here👇.Agent Workflow Memory: The Agent Company:
How far are we from having competent AI co-workers that can perform tasks as varied as software development, project management, administration, and data science?. In our new paper, we introduce TheAgentCompany, a benchmark for AI agents on consequential real-world tasks.
0
0
1