ZhiruoW Profile Banner
Zora Wang Profile
Zora Wang

@ZhiruoW

Followers
1K
Following
880
Media
34
Statuses
224

PhD student @LTIatCMU + visiting @StanfordNLP | prev @Amazon Alexa AI, @Microsoft Research, Asia | fun 👩🏻‍💻 🐈 💃 🪴 🎶

Joined August 2021
Don't wanna be here? Send us removal request.
@ZhiruoW
Zora Wang
2 months
Meet ASI: Agent Skill Induction.A framework for online programmatic skill learning — no offline data, no training. 🧠 Build reusable skills during test.📈 +23.5% success, +15.3% efficiency.🌐 Scales to long-horizon tasks, transfers across websites. Let's dive in! 🧵
Tweet media one
2
83
159
@ZhiruoW
Zora Wang
2 years
Everyone is using RAG, but most of the retrieved context is noisy! 🚨.Introducing FilCo: “Learning to Filter Context for Retrieval-Augmented Generation”. TL;DR: Get rid of the irrelevant content using FilCo, and you'll get better outputs. Preprint:
6
56
282
@ZhiruoW
Zora Wang
9 months
How can we create AI agents that continually improve, learning from past successes?.Presenting 🌟Agent Workflow Memory🌟, which allows agents to induce, learn, and use task workflows from experiences on the fly🪽.Adding AWM to a strong agent improves accuracy by 51.1% on
Tweet media one
7
55
245
@ZhiruoW
Zora Wang
1 year
Tools can empower LMs to solve many tasks. But what are tools anyway?. Our survey studies tools for LLM agents w/.–A formal def. of tools.–Methods/scenarios to use&make tools.–Issues in testbeds and eval metrics.–Empirical analysis of cost-gain trade-off
Tweet media one
2
62
213
@ZhiruoW
Zora Wang
11 months
I will give a tutorial about "LLM for Tabular Data" at @SIGIRConf from 9-12:30 today, together with @HaoyuDong9, my previous mentor at Microsft Research, Asia!.Join our tutorial in person, or check our slides/recording online if you're interested!.
Tweet media one
4
32
186
@ZhiruoW
Zora Wang
11 months
Introducing 🔥CodeRAG-Bench🔥 a benchmark for retrieval-augmented code generation!.🔗- Supports 8 codegen tasks and 5 retrieval sources.- Canonical document annotation for all coding problems.- Robust evaluation of retrieval and end-to-end execution
6
34
146
@ZhiruoW
Zora Wang
1 year
Heading to #EMNLP2023 next week ✈️.If you’re interested in Code Generation🧑‍💻don't hesitate to check out our two papers!. - ODEX, A challenging benchmark with open-domain coding queries: - API-assisted code generation for tableQA:
Tweet media one
2
22
94
@ZhiruoW
Zora Wang
1 month
Couldn't agree more on agent "continually adapt" from "streamed experiences"!.This is exactly what we've envisioned in building online adaptive agents with self-induced evolving memory & skills in AWM ( and ASI (!.Yet still some.
@RichardSSutton
Richard Sutton
2 months
David Silver really hits it out of the park in this podcast. The paper "Welcome to the Era of Experience" is here:
1
66
68
@ZhiruoW
Zora Wang
1 year
Do you find LM-written programs too complex to understand?.Do bugs often pop up in these solutions?. Check out *TroVE*, a training-free method to create accurate, concise, and verifiable solutions by inducing tools.🔗:
1
21
79
@ZhiruoW
Zora Wang
11 months
Excited to share that our survey has been accepted at the very first @COLM_conf!.Check out our paper if you want to learn more about tool-augmented language models👩‍🔧.
@ZhiruoW
Zora Wang
1 year
Tools can empower LMs to solve many tasks. But what are tools anyway?. Our survey studies tools for LLM agents w/.–A formal def. of tools.–Methods/scenarios to use&make tools.–Issues in testbeds and eval metrics.–Empirical analysis of cost-gain trade-off
Tweet media one
1
5
68
@ZhiruoW
Zora Wang
1 year
Due to popular demand, we released our pre-trained models for filtering context to improve RAG accuracy & efficiency: 🌐[open] for general Wikipedia queries.🧐[multihop] for complex multi-hop tasks.💬[fact] & [dialog] for fact verification and dialog tasks
Tweet media one
@ZhiruoW
Zora Wang
2 years
Everyone is using RAG, but most of the retrieved context is noisy! 🚨.Introducing FilCo: “Learning to Filter Context for Retrieval-Augmented Generation”. TL;DR: Get rid of the irrelevant content using FilCo, and you'll get better outputs. Preprint:
0
10
65
@ZhiruoW
Zora Wang
3 months
Excited to share that our CowPilot🐮 is accepted to #NAACL 2025 Demo Track! .Definitely check out our user study if you're interested in trying out CowPilot:
@FariaHuqOaishi
Faria Huq | 🦋: fariahuqoaishi
4 months
[1/6] 🤔 Ever wondered if you could collaborate with an agent on web tasks?. ​​We present CowPilot 🐮, a framework for human-agent collaboration in web navigation that allows humans to intervene dynamically. 📄 🌐
0
72
38
@ZhiruoW
Zora Wang
8 months
I'll be at the @COLM_conf from Oct 7-9th, and present our work on Monday morning poster session.Come check out our poster if you're interested!.Also feel free to DM me if you want to talk about code generation, tools, and agents 🙌.
@ZhiruoW
Zora Wang
1 year
Tools can empower LMs to solve many tasks. But what are tools anyway?. Our survey studies tools for LLM agents w/.–A formal def. of tools.–Methods/scenarios to use&make tools.–Issues in testbeds and eval metrics.–Empirical analysis of cost-gain trade-off
Tweet media one
0
5
55
@ZhiruoW
Zora Wang
30 days
Excited to share that AWM has been accepted at #ICML2025 🥳.Check out our online memory-adaptive agent if you haven't! 🔗
@ZhiruoW
Zora Wang
9 months
How can we create AI agents that continually improve, learning from past successes?.Presenting 🌟Agent Workflow Memory🌟, which allows agents to induce, learn, and use task workflows from experiences on the fly🪽.Adding AWM to a strong agent improves accuracy by 51.1% on
Tweet media one
1
25
50
@ZhiruoW
Zora Wang
3 months
📣📣 Attending #AAAI25 next week?.I will give two talks about "Agent Workflow Memory" and "The Agent Company", and sit at the panel afterward 🎤.Join me at the talk and panel sessions at the WebAgent workshop on Mar 3rd!.
1
60
39
@ZhiruoW
Zora Wang
1 year
Our arxiv preprint is released now!.🔗: If you know other awesome papers on tool use in LLMs, please let us know and feel free to open a PR!.👩‍💻:
@ZhiruoW
Zora Wang
1 year
Tools can empower LMs to solve many tasks. But what are tools anyway?. Our survey studies tools for LLM agents w/.–A formal def. of tools.–Methods/scenarios to use&make tools.–Issues in testbeds and eval metrics.–Empirical analysis of cost-gain trade-off
Tweet media one
1
10
54
@ZhiruoW
Zora Wang
2 months
Totally agree that agents should improve throughout streamed eval! Happy to see Agent Workflow Memory ( mentioned as an example😎.Also read our recent work ASI (Agent Skill Induction) that induces programmatic skills on the fly and further boosts success &.
@ShunyuYao12
Shunyu Yao
2 months
I finally wrote another blogpost: AI just keeps getting better over time, but NOW is a special moment that i call “the halftime”. Before it, training > eval. After it, eval > training. The reason: RL finally works. Lmk ur feedback so I’ll polish it.
1
47
40
@ZhiruoW
Zora Wang
3 months
Short notice but I'll give another talk at the multi-agent workshop ( on Mar 4th, 14:35 -15:10!.Catch me in the talk or panel session if you're still around 🤗.
@ZhiruoW
Zora Wang
3 months
📣📣 Attending #AAAI25 next week?.I will give two talks about "Agent Workflow Memory" and "The Agent Company", and sit at the panel afterward 🎤.Join me at the talk and panel sessions at the WebAgent workshop on Mar 3rd!.
0
72
27
@ZhiruoW
Zora Wang
1 month
Cannot attend #ICLR2025 in person (will be NAACL and Stanford soon!), but do check out 👇.▪️Apr 27: "Exploring the Pre-conditions for Memory-Learning Agents" led by @viishruth and Vishwa Shah, at SSI-FM workshop.▪️Apr 28: our @DL4Code workshop with a fantastic line of works &.
@DL4Code
Deep Learning For Code @ ICLR'25
1 month
Just 6 days until #DL4C! 🗓️ Daniel Fried (CMU / Meta AI) @dan_fried @AIatMeta will be sharing insights on how inducing functions from code makes LLM agents smarter and more efficient. Don't miss it! See you Sunday! #ICLR2025 #iclr.
0
58
24
@ZhiruoW
Zora Wang
1 month
I will be at #NAACL2025 with my cat 🐈, presenting three works:.- Apr 30: CodeRAG-Bench co-led with @AkariAsai: - May 2: CowPilot led by @FariaHuqOaishi: - May 2: Fail-TaLMs led by Eduardo Treviño and Hugo Contant, that benchmark.
1
50
35
@ZhiruoW
Zora Wang
1 year
Excited to share that our TroVE is now accepted at #ICML2024! Building TroVE is such a fun experience to explore LM autonomy 🥳.I will also present this work at our CMU Agent Workshop ( today, stop by if you're interested!.
@ZhiruoW
Zora Wang
1 year
Do you find LM-written programs too complex to understand?.Do bugs often pop up in these solutions?. Check out *TroVE*, a training-free method to create accurate, concise, and verifiable solutions by inducing tools.🔗:
1
6
38
@ZhiruoW
Zora Wang
7 months
A fantastic trip to Urbana-Champaign🚜 and gave a talk at the UIUC iSE Speaker Series about "Solving Real-World Tasks via Program Generation". Many thanks to my hosts @LingmingZhang @JiaweiLiu_ and Yinlin Deng!!.Beautiful campus, great Chinese food, and a lot of corn fields🌽 ;).
3
1
34
@ZhiruoW
Zora Wang
10 months
Current methods improve program efficiency at the cost of sacrificing correctness 😟.Check out our new benchmark, ECCO, targeting correctness-preserving program optimization!.As well as a full set of explorations on various methods 🌐.
@viishruth
Vishruth Veerendranath
10 months
Can current code LMs generate sufficiently efficient programs? 🤔.More importantly, Can these LMs improve code efficiency without sacrificing correctness?. Check out ECCO, our code-gen benchmark for correctness-preserving program optimizations!.🧵 1/n
Tweet media one
0
5
30
@ZhiruoW
Zora Wang
5 months
Wonder how good agents are in real-world tasks⁉️.In TheAgentCompany, we create difficult tasks featuring varied skills (DS, research, finance, etc.), support evaluations end-to-end and on intermediate checkpoints, and further benchmark agents with top-performing open/closed LLMs.
@gneubig
Graham Neubig
5 months
How far are we from having competent AI co-workers that can perform tasks as varied as software development, project management, administration, and data science?. In our new paper, we introduce TheAgentCompany, a benchmark for AI agents on consequential real-world tasks.
Tweet media one
1
5
25
@ZhiruoW
Zora Wang
5 months
Excited to co-organized the DL4C workshop at @iclr_conf'25. Check out our call for papers and submit your interesting codegen paper! 😉.
@DL4Code
Deep Learning For Code @ ICLR'25
5 months
🚀Excited to share the 3rd Deep Learning for Code workshop is back at @iclr_conf'25! This year we’ll focus on emergent challenges in the field, e.g., agents, post-training, developer productivity, open science, and benchmarking for code Submit by Feb 3 �⬇️.
0
10
23
@ZhiruoW
Zora Wang
8 months
Excited to share that ECCO has been accept to #EMNLP2024 🥳 thanks to all the great work of Siddhant and @viishruth .Come check out our benchmark if you haven’t!.
@ZhiruoW
Zora Wang
10 months
Current methods improve program efficiency at the cost of sacrificing correctness 😟.Check out our new benchmark, ECCO, targeting correctness-preserving program optimization!.As well as a full set of explorations on various methods 🌐.
2
4
20
@ZhiruoW
Zora Wang
7 months
Sad to miss #EMNLP2024 but do check out our paper "ECCO: Can We Improve Model-Generated Code Efficiency Without Sacrificing Functional Correctness?" presented by @viishruth and Siddhant.Tuesday 11-12:30 at Poster Session 02‼️.
@viishruth
Vishruth Veerendranath
7 months
I’m attending #EMNLP2024 in Miami from 11-16th Nov to present ECCO on Tuesday 🏖️. Looking forward to meeting folks and chatting more about code generation and LLM agents!.
0
2
16
@ZhiruoW
Zora Wang
2 years
FilCo is super easy to use! Come check our github repository if you’re interested:. Lastly, great thanks to my collaborators and mentors: Jun Araki, Zhengbao Jiang, Md Rizwan Parvez, and @gneubig.
0
2
13
@ZhiruoW
Zora Wang
9 months
Oops, almost forgot the paper link:
0
1
11
@ZhiruoW
Zora Wang
11 months
For document retrieval, we evaluate sparse BM25, dense open-checkpoint retrievers, and proprietary OpenAI/Voyage AI APIs. We find:.😖Across coding tasks, most models struggle outside basic tasks.💸Larger models are more accurate, yet at the cost of 5x space and 100x latency
Tweet media one
1
1
11
@ZhiruoW
Zora Wang
9 months
Super fun project w/ @maojiayuan, @dan_fried, and @gneubig❤️‍🔥.Check our code here: Don’t hesitate to reach out if you have any questions, thoughts, or feedback! 🤗.
1
0
10
@ZhiruoW
Zora Wang
10 months
Unfortunately I'm not able to attend ICML due to visa issues🥲.But do check our poster at Hall C 4-9 #615, Thu 1:30-3 pm ‼️.I'm also happy to chat online about TroVE, or any topics related to code gen, tool use, and agents.
@ZhiruoW
Zora Wang
1 year
Do you find LM-written programs too complex to understand?.Do bugs often pop up in these solutions?. Check out *TroVE*, a training-free method to create accurate, concise, and verifiable solutions by inducing tools.🔗:
5
2
10
@ZhiruoW
Zora Wang
1 year
Join us in the Agent Workshop! Will have a lot of fun: talks, tutorials, hackathon, posters, . 🥳.
@frankxu2004
Frank Xu
1 year
On May 2-3, we're going to have a big event in Pittsburgh about LLM Agents. We have invited talks from great speakers inside and outside CMU, student research presentations and posters, tutorials and discussions! Come join us at CMU campus, and register at
0
0
9
@ZhiruoW
Zora Wang
1 year
Glad to see more works targeting open-domain library usage, just as our ODEX benchmark (.
@BigCodeProject
BigCode
1 year
Introducing 🌸BigCodeBench: Benchmarking Large Language Models on Solving Practical and Challenging Programming Tasks!. BigCodeBench goes beyond simple evals like HumanEval and MBPP and tests LLMs on more realistic and challenging coding tasks.
0
1
9
@ZhiruoW
Zora Wang
11 months
CodeRAG-Bench has.📜Documents from 5 sources: solutions, tutorials, library docs, StackOverflow posts, GitHub repos.🖥️ 8 coding tasks w/ gold doc annotation✍️.Basic: HumanEval, MBPP, LiveCodeBench.Open domain: DS-1000, ODEX.Repo-level: RepoEval, SWE-bench.Retrieval: CodeSearchNet
Tweet media one
1
0
9
@ZhiruoW
Zora Wang
11 months
While RAG is effective in many text-centric tasks, we still lack an understanding of RAG on coding tasks:.1) Which retrievers are good for code retrieval,.2) What documents to retrieve from,.3) How much RAG help on various codegen tasks?. We built CodeRAG-Bench to study them!.
1
0
9
@ZhiruoW
Zora Wang
1 year
Our paper is unluckily on hold by arXiv, we’ll share the link once it’s ready!.Meanwhile, you can find.our paper at: and our awesome tool repo: Thanks to my collaborators @ChengZhoujun @_Hao_Zhu and advisors @dan_fried @gneubig.
2
2
7
@ZhiruoW
Zora Wang
1 year
LMs are powerful for text-gen tasks. But they.- face difficulty with queries that require complex reasoning.- are fundamentally unable to solve tasks beyond their training data. Tools came in to help.- facilitate solving complex queries.- extend LMs’ ability to perform more tasks
Tweet media one
1
1
7
@ZhiruoW
Zora Wang
11 months
Very pleasant experience co-leading the project w/ @AkariAsai, huge thanks to @XinyanVYu @frankxu2004 @YiqingXieNLP for the contributions, and @gneubig @dan_fried for the great advices 🫶. Add your data into CodeRAG-Bench? Create a PR!.Try more models? Submit to our leaderboard!.
0
1
8
@ZhiruoW
Zora Wang
10 months
very nice elaboration on the 3 criteria of good benchmarks: natural, evaluateable, and challenging 📈.
@OfirPress
Ofir Press
10 months
New blogpost on How to Build Good Language Modeling Benchmarks:.
1
0
7
@ZhiruoW
Zora Wang
11 months
When augmented with annotated or retrieved documents from canonical sources, pass@1 improves for most codegen models and on most tasks. Except that on open-domain tasks, GPT and DeepSeekCoder gain little from library docs, possibly due to their prior familiarity with these docs🧐
Tweet media one
1
0
7
@ZhiruoW
Zora Wang
2 years
Most commonly for RAG: retrieve top-K passages→concat all K passages→feed into the model. But, we studied 6 tasks and found 13-60% of the passages irrelevant. Even for the relevant ones, only <10% of the content is actually useful!.This noisy content can hurt model generation🤯.
1
0
6
@ZhiruoW
Zora Wang
2 months
Joint work with @apurvasgandhi @gneubig @dan_fried. Check out the following resources for more details:.🔗Preprint: 💻Codebase:
0
0
6
@ZhiruoW
Zora Wang
11 months
We also explore retrieving documents from open sources, and find some especially helpful sources:.Basic coding: online tutorials and StackOverflow posts help.Open-domain: program solutions and GitHub files help.Repo-level: all sources are less critical than local-repo files
Tweet media one
1
0
6
@ZhiruoW
Zora Wang
2 years
Compared to the baseline methods, FilCo consistently performs better across all tasks, and by at most 19%. Not only this, FilCo can effectively filter *both* positive and negative passages, greatly reduce the input size (by 44 - 64%), and increase context precision for all tasks.
Tweet media one
1
0
5
@ZhiruoW
Zora Wang
7 months
Feel free to check out my slides here:
0
0
6
@ZhiruoW
Zora Wang
1 year
We can use code LMs, but current tool-making pipelines are pretty complex 😕.Need training data, multi-turn refinement, self-verification, or create many irreusable tools . All you need for TroVE is execution agreement 🤝
Tweet media one
1
1
5
@ZhiruoW
Zora Wang
2 months
Online adaptive agents with ever-growing textual workflow memory (check out our AWM!) already outperform static agent baselines. But we go further — by inducing programmatic skills, agents:.- Expand their action space.- Achieve higher success & efficiency.- Offers verifiability
Tweet media one
1
0
6
@ZhiruoW
Zora Wang
1 year
Want to query tables easily? We use Python programs for QA on arbitrarily structured tables, by.- using Pandas Multi-Index as a unified representation, for relational, hierarchical, and even multiple tables 🗃️.- creating Operation & QA API tools to expand model functionalities ⛏️
Tweet media one
1
1
6
@ZhiruoW
Zora Wang
1 year
Want a more realistic dataset? Our ODEX benchmark provides open-domain queries that.- covers 79 Python libraries ✅ instead of: built-in grammar in HumanEval ❌ or limited domains ❌.- support 4 NL inputs: English, Spanish, Japanese, Russian 🌏. GitHub:
1
0
6
@ZhiruoW
Zora Wang
1 year
In which scenarios are tools helpful?.📖knowledge access: beyond parametric knowledge.💼computation activities: for complex reasoning.🌐interact w/ the world: for real-time, real-world data.🎞️non-textual modalities: break the modality boundary.🧠skilled LMs: using NNs as tools
Tweet media one
1
2
5
@ZhiruoW
Zora Wang
1 year
@chrisgorgo Our ODEX dataset too!
0
1
5
@ZhiruoW
Zora Wang
10 months
could be a useful tool for my literature review🧐.
@gaotianyu1350
Tianyu Gao
10 months
Google cannot find the paper you want? Introducing LitSearch, a retrieval benchmark with realistic and challenging scientific literature search questions. Paper: Data/code:
Tweet media one
0
0
4
@ZhiruoW
Zora Wang
1 year
Enjoyable works with my collaborators: @shuyanzhxyc, @dan_fried, @gneubig, and three students: Yihan, Shuyi, @theryanliu. Stay tuned for more info about our presentation. Happy to chat more at the conference too!.
0
0
3
@ZhiruoW
Zora Wang
1 year
How do we evaluate tools?. Testbeds.- existing datasets need reasoning over text, structured data, and images.- API benchmarks, yet with naturalness and executability issues 😧. Metrics: task completion, tool selection. Are these enough? No! Check out our concrete suggestions 👇
Tweet media one
1
0
5
@ZhiruoW
Zora Wang
1 year
Fun work w/ @gneubig and @dan_fried!.Check out our code repository at: Let us know what other interesting tasks you want to try TroVE on!.
0
2
5
@ZhiruoW
Zora Wang
2 months
⏱️ASI’s efficiency advantage becomes even clearer on long-horizon tasks. Take the change address example:.😵‍💫 Static agents struggle to follow instructions.🐢 Memory-adaptive agents succeed — but require tens of steps.⚡ ASI solves it with just 3 skill calls — clean and fast.
Tweet media one
1
0
5
@ZhiruoW
Zora Wang
2 months
Even when suddenly transferred to a new website, ASI can.🔁 Reuses transferable skills to solve tasks out-of-the-box. 🔧 Learns to adapt incompatible skills to the new web environment
Tweet media one
1
0
5
@ZhiruoW
Zora Wang
1 year
We deep dive into the efficiency aspect and empirically analyze the trade-off b/w.- compute costs of tool integration.- performance gain brought by tools. Intriguingly, we reveal that:.- tools may not be helpful to all tasks.- the efficiency of tooling approaches varies a lot!
Tweet media one
2
1
5
@ZhiruoW
Zora Wang
2 years
To remove this noise, we propose FilCo, a method that filters out irrelevant/distracting content and facilitates high-quality generation.
Tweet media one
1
0
5
@ZhiruoW
Zora Wang
1 year
With both CodeLLaMa2-7B and GPT-4, TroVE generates solutions with.- higher accuracy 📈 and.- reduced complexity 📉.- using 79-98% smaller toolboxes
Tweet media one
1
1
5
@ZhiruoW
Zora Wang
2 months
In our streamed online evaluation, the agent processes each NL query as follows:.1️⃣ Attempts the task by generating an action trajectory.2️⃣ If the trajectory is deemed correct by the neural evaluator.3️⃣ Induces a programmatic skill, then verifies it through test-time execution
Tweet media one
1
0
5
@ZhiruoW
Zora Wang
1 year
Curious about how to generate image & videos with frozen LLMs? Come check our spotlight poster on 12/12, 17:15-19:15 #NeurIPS.
@LijunYu0
Lijun Yu
1 year
🙋How to do multimodal generation like Gemini with a text-only LLM without tuning?.✅SPAE Tokenizer is all you need!. 🔥NeurIPS’23 Spotlight⬇️.📑SPAE: Semantic Pyramid AutoEncoder for Multimodal Generation with Frozen LLMs.🕒12/12 Tue 17:15-19:15 CST.🌍Spotlight poster #118
Tweet media one
Tweet media two
Tweet media three
0
0
4
@ZhiruoW
Zora Wang
1 year
TroVe also produces solutions that are easier to verify by humans 🧐.We evaluate humans in their accuracy and speed of verifying model-generated solutions. With TroVE, the verifications are:.- 13% more accurate, and.- 31% faster 💨
Tweet media one
1
1
4
@ZhiruoW
Zora Wang
1 year
Interestingly, TroVE can induce diverse functions for datasets of different domains, revealing their unique characteristics!
Tweet media one
1
0
4
@ZhiruoW
Zora Wang
1 year
Exciting news! 🥳.
@shuyanzhxyc
Shuyan Zhou
1 year
I am thrilled to announce that I will be joining @DukeU . @dukecompsci as an Assistant Professor in summer 2025. Super excited for the next chapter! Stay tuned for the launch of my lab 🧠🤖
Tweet media one
0
0
4
@ZhiruoW
Zora Wang
2 months
We benchmark ASI on WebArena against:.(i) Static agents using a Claude-3.5 backbone.(ii) Memory-adaptive agents with AWM. 📈 ASI boosts success rates by +23.5% and +11.3%.🔋 And reduces trajectory length by −10.7% and −15.3%.Fewer steps. More wins. 😎
Tweet media one
1
0
4
@ZhiruoW
Zora Wang
1 year
To solve programmatic tasks:.- The common approach (middle): writing a tedious, complex, error-prone program.- A better way (right): using high-level functions, i.e., tools. How do we create these tools?
Tweet media one
1
1
4
@ZhiruoW
Zora Wang
9 months
AWM works both offline and online :.With extra data, AWM abstracts common workflows from training examples.Even with zero extra data, AWM induces workflows from past experiences, adds new workflows to agent memory, and applies them to solve future tasks📈
Tweet media one
1
0
4
@ZhiruoW
Zora Wang
1 year
What is the basic tool use paradigm?.Text generation → complete tool calling expression → trigger tool-execution mode → server returns results to LM → text generation → …
Tweet media one
1
1
4
@ZhiruoW
Zora Wang
1 year
@ShunyuYao12 On the other hand, I think "using tools" is a subset of "taking actions". Agents can take actions through actuators, where actuators can be part of its body (move_arm) or external tools (use_rod).For both points you can find more detailed elaborations in section 2 in our survey!.
0
0
3
@ZhiruoW
Zora Wang
1 year
@EliasEskin @ArchikiPrasad @mohitban47 @uncnlp Interesting work! ReGAL seems to share a lot of findings with our work TroVE: (reusable tools, increased task accuracy, easier to use, etc.).Happy to see that this program abstraction idea can improve tasks in this work too!.
0
1
3
@ZhiruoW
Zora Wang
9 months
On WebArena, AWM is the SOTA open method -- 35.5% 🎉 featuring:.- Continuous and increasingly complex workflow induction and utilization.- Rapid learning process with only tens of examples.- Workflows that effectively generalize across task templates
Tweet media one
1
0
3
@ZhiruoW
Zora Wang
1 year
Filtering noisy contexts? or training more robust models?.
@omarsar0
elvis
1 year
How Faithful are RAG Models? . This new paper aims to quantify the tug-of-war between RAG and LLMs' internal prior. It focuses on GPT-4 and other LLMs on question answering for the analysis. It finds that providing correct retrieved information fixes most of the model
Tweet media one
0
0
3
@ZhiruoW
Zora Wang
1 year
Advanced topics for LM tooling.- beyond single & multi-turn tool use, how about more complex scenarios: nested, parallel, and even iterative tool calling?.- beyond existing human-crafted tools, how about asking LMs to make tools and reuse them?
Tweet media one
1
1
3
@ZhiruoW
Zora Wang
1 year
What is an LM-used tool?.An interface to a computer program that runs externally to the LM. What functions do tools have?.Perception👀 action🦾 and computation🔣. Relationship b/w agents and tools?.Agents can use all tools; but LMs using computation tools only are not agents 🤖.
1
0
3
@ZhiruoW
Zora Wang
9 months
@yugu_nlp Great question! An "optimal" order would potentially be arranging the examples from the easiest to the hardest. We stick to the original order because WebArena intentionally ordered their examples (in a way that previous tasks won't affect later ones). Still, that would be an.
0
0
2
@ZhiruoW
Zora Wang
11 months
@xueqing_w @SIGIRConf @HaoyuDong9 Thanks for your interest😃 The slides should be available now!.
0
0
2
@ZhiruoW
Zora Wang
1 year
@ShunyuYao12 I think "language agents" are a subset of "LM"s, whether you can call an LM as an LM-based agent depends on the task and its activities.
0
0
2
@ZhiruoW
Zora Wang
11 months
@hongjin_su Thanks! We have already released the code and data 👇.code: data:
0
0
2
@ZhiruoW
Zora Wang
9 months
On Mind2Web, AWM also scores the best in cross-task, website, and domain scenarios, among text-based agents. Particularly on test splits that have domain gaps with training examples, AWM online exhibits greater generalization ability as train-test distribution gaps wide
Tweet media one
1
0
2
@ZhiruoW
Zora Wang
10 months
@storyweaver48 @SIGIRConf @HaoyuDong9 Thank you!! The video should be soon available on the ACM youtube channel:
1
0
1
@ZhiruoW
Zora Wang
1 year
1
0
1
@ZhiruoW
Zora Wang
8 months
@daye_nam @UCIrvine Congrats Daye!!.
1
0
1
@ZhiruoW
Zora Wang
1 year
@somakaditya @ChengZhoujun @_Hao_Zhu @dan_fried @gneubig Thanks for sharing your work! We also have a collection of awesome TALM papers in this repo: Free feel to create a PR and add your work!.
0
0
1
@ZhiruoW
Zora Wang
17 days
0
0
1
@ZhiruoW
Zora Wang
3 months
Check out more details here👇.Agent Workflow Memory: The Agent Company:
@gneubig
Graham Neubig
5 months
How far are we from having competent AI co-workers that can perform tasks as varied as software development, project management, administration, and data science?. In our new paper, we introduce TheAgentCompany, a benchmark for AI agents on consequential real-world tasks.
Tweet media one
0
0
1
@ZhiruoW
Zora Wang
1 year
interesting analysis on code verifiers 🧐.
@arankomatsuzaki
Aran Komatsuzaki
1 year
Google presents Many-Shot In-Context Learning. - Proposes many-shot ICL, i.e., adding up to thousands of examples in context with Gemini 1.5, which boosts the perf significantly.- Using synthetic CoT is very effect in this setting.
Tweet media one
0
0
1