Zhiyu Yang
@SauceTesla
Followers
53
Following
7K
Media
23
Statuses
2K
PhD @ UT Dallas, Prev. RA @ Singapore Management University, Prev. Research Intern @ THUNLP
Joined April 2018
🔥Thrilled to share: our paper “Why Stop at One Error? Benchmarking LLMs as Data Science Code Debuggers for Multi-Hop and Multi-Bug Errors” was selected as an Oral at EMNLP 2025. 🥳 Paper:
arxiv.org
LLMs are transforming software development, yet current code generation and code repair benchmarks mainly assess syntactic and functional correctness in simple, single-error cases. LLMs'...
1
0
19
ArcMemo yields +7.5% relative on ARC-AGI vs o4-mini (same backbone). It extends the LLM idea of “compressing knowledge for generalization” into a lightweight, continually learnable abstract memory—model-agnostic and text-based. Preprint: Lifelong LM Learning via Abstract Memory
4
30
129
CWM shows that reasoning can benefit from step-by-step simulation of code execution. 🔹 Our latest evaluation results show that CWM achieves 47% accuracy on LogicIFEval, ranking #1 among all tested public models! 📄 LogicIF Paper: https://t.co/SdczQgG3XA This result suggests
New from Meta FAIR: Code World Model (CWM), a 32B-parameter research model designed to explore how world models can transform code generation and reasoning about code. We believe in advancing research in world modeling and are sharing CWM under a research license to help empower
0
2
6
Hopefully we will also feature this work at LAW@NeurIPS 2025, where the story had already become Demystify the Potential of Large Language Models as World Models of Code when I made the submission last month.
I want to quietly mention that we basicly came up with the same idea of code execution as world models six months ago, and it turned out to be a EMNLP'25 paper (top 0.5% meta score). Check out: https://t.co/8lKZk9TNSA . Glad to see META pushing it a lot further.
1
2
6
(đź§µ) Today, we release Meta Code World Model (CWM), a 32-billion-parameter dense LLM that enables novel research on improving code generation through agentic reasoning and planning with world models. https://t.co/BJSUCh2vtg
60
313
2K
Why does AI sometimes fail to generalize, and what might help? In a new paper, we highlight the latent learning gap — which unifies findings from language model weaknesses to agent navigation — and suggest that episodic memory complements parametric learning to bridge it. Thread:
19
101
553
Another fantastic @AIatMeta paper. Large language models keep redoing the same work inside long chains of thought, so this paper teaches the model to compress those recurring steps into small named behaviors that it can recall later or even learn into its weights. Shows that
16
87
532
LeRobot SO101 setup just got 50% cheaper! You can now teleoperate your follower arm right from your phone. 🤯 But that's not all. Our new pipeline feature lets you record and train AI models in end-effector space, or with any other features. The possibilities are endless!
11
66
448
Are expensive labeled data and rejection sampling truly necessary for developing self-improving reasoning models? Introducing Unsupervised Prefix Fine-Tuning (UPFT) -- an efficient method that trains models on only the first 8-32 tokens of single self-generated solutions,
4
29
160
Automated software testing is growing in importance in the era of AI-assisted coding. Agentic coding systems accelerate development but are also unreliable. Agentic testing — where you ask AI to write tests and check your code against them — is helping. Automatically testing
74
205
1K
🚀 Introducing RAST: Reasoning Activation via Small Model Transfer! ✨ RAST adjusts key "reasoning tokens" at decoding time using insights from smaller RL-tuned models — no full RL tuning for large models! ⚡ Efficient & Performant,🧠Scalable & Easy,📉 Up to 50% less GPU memory!
3
21
117
Self-Improving Embodied FMs • 2-stage recipe: SFT + online RL w/ self-predicted rewards • Boosts sample efficiency: 10% robot time → 45%→75% success (vs. 8× data → only 60%) • Unlocks autonomous skill acquisition beyond imitation data
4
33
247
🚨Reasoning LLMs are e̵f̵f̵e̵c̵t̵i̵v̵e̵ ̵y̵e̵t̵ inefficient! Large language models (LLMs) now solve multi-step problems by emitting extended chains of thought. During the process, they often re-derive the same intermediate steps across problems, inflating token usage and
4
35
210
Latent Learning: why LLMs miss future-useful info • AI often fails at latent learning (e.g. reversal curse) • Cognitive science points to episodic memory as a fix • Oracle retrieval shows better generalization • Episodic memory + parametric learning = more human-like AI
8
32
237
Robust tool calling is the key to general agentic intelligence. Easier said than done. This is a fantastic paper on improving and scaling function calling capabilities in AI agents. (bookmark it) Here are my notes:
9
84
432
Very cool work from Meta Superintelligence Lab. They are open-sourcing Meta Agents Research Environments (ARE), the platform they use to create and scale agent environments. Great resource to stress-test agents in environments closer to real apps. Read on for more:
39
184
1K
📣 Check out our #EMNLP2025 paper on a new benchmark for LLMs as Data Science Code Debuggers 🚀
🔥Thrilled to share: our paper “Why Stop at One Error? Benchmarking LLMs as Data Science Code Debuggers for Multi-Hop and Multi-Bug Errors” was selected as an Oral at EMNLP 2025. 🥳 Paper:
0
1
18
I’m honored and excited that our paper was selected as an Oral at EMNLP 2025, though sadly I cannot attend in person due to administrative restrictions. My supervisor, Professor Yang Deng, will present in Suzhou. See you all online!
0
0
0
Resources: đź“„ Paper: https://t.co/EoK7FTxuDO đź’» Code & data: https://t.co/8GsEvsP1Ix This work is supervised by Prof. Yang Deng and in collaboration with my previous mentors @ShuoWang_NLP & Yukun Yan! #EMNLP2025 #LLM #Debugging #DataScience
github.com
Open source data, annotation and evaluation framework for DSDBench paper, accepted at EMNLP 2025 Oral. - KevinCL16/DSDBench
1
0
1
❗️This reveals a key insight: Agentic systems can “pass” benchmarks by over-rewriting, but DSDBench isolates whether models actually understand faulty execution flows, where trustworthiness really matters.
1
0
0
We went further: Agentic evaluations. ▶️ A Claude-based Cursor agent achieved ~49% pass@1 on single-bug fixes, higher than standalone Claude’s 34% localization. ▶️ But in unconstrained mode, agents often brute-forced success by rewriting entire code, masking true reasoning gaps.
1
0
0