SauceTesla Profile Banner
Zhiyu Yang Profile
Zhiyu Yang

@SauceTesla

Followers
53
Following
7K
Media
23
Statuses
2K

PhD @ UT Dallas, Prev. RA @ Singapore Management University, Prev. Research Intern @ THUNLP

Joined April 2018
Don't wanna be here? Send us removal request.
@SauceTesla
Zhiyu Yang
2 months
🔥Thrilled to share: our paper “Why Stop at One Error? Benchmarking LLMs as Data Science Code Debuggers for Multi-Hop and Multi-Bug Errors” was selected as an Oral at EMNLP 2025. 🥳 Paper:
Tweet card summary image
arxiv.org
LLMs are transforming software development, yet current code generation and code repair benchmarks mainly assess syntactic and functional correctness in simple, single-error cases. LLMs'...
1
0
19
@matt_seb_ho
Matt Ho
1 month
ArcMemo yields +7.5% relative on ARC-AGI vs o4-mini (same backbone). It extends the LLM idea of “compressing knowledge for generalization” into a lightweight, continually learnable abstract memory—model-agnostic and text-based. Preprint: Lifelong LM Learning via Abstract Memory
4
30
129
@_Guuuuuuuu_
Mian Zhang
2 months
CWM shows that reasoning can benefit from step-by-step simulation of code execution. 🔹 Our latest evaluation results show that CWM achieves 47% accuracy on LogicIFEval, ranking #1 among all tested public models! 📄 LogicIF Paper: https://t.co/SdczQgG3XA This result suggests
@AIatMeta
AI at Meta
2 months
New from Meta FAIR: Code World Model (CWM), a 32B-parameter research model designed to explore how world models can transform code generation and reasoning about code. We believe in advancing research in world modeling and are sharing CWM under a research license to help empower
0
2
6
@Lyubh22
Bohan Lyu
2 months
Hopefully we will also feature this work at LAW@NeurIPS 2025, where the story had already become Demystify the Potential of Large Language Models as World Models of Code when I made the submission last month.
@KnightNemo_
Siqiao Huang
2 months
I want to quietly mention that we basicly came up with the same idea of code execution as world models six months ago, and it turned out to be a EMNLP'25 paper (top 0.5% meta score). Check out: https://t.co/8lKZk9TNSA . Glad to see META pushing it a lot further.
1
2
6
@syhw
Gabriel Synnaeve
2 months
(đź§µ) Today, we release Meta Code World Model (CWM), a 32-billion-parameter dense LLM that enables novel research on improving code generation through agentic reasoning and planning with world models. https://t.co/BJSUCh2vtg
60
313
2K
@AndrewLampinen
Andrew Lampinen
2 months
Why does AI sometimes fail to generalize, and what might help? In a new paper, we highlight the latent learning gap — which unifies findings from language model weaknesses to agent navigation — and suggest that episodic memory complements parametric learning to bridge it. Thread:
19
101
553
@rohanpaul_ai
Rohan Paul
2 months
Another fantastic @AIatMeta paper. Large language models keep redoing the same work inside long chains of thought, so this paper teaches the model to compress those recurring steps into small named behaviors that it can recall later or even learn into its weights. Shows that
16
87
532
@LeRobotHF
LeRobot
2 months
LeRobot SO101 setup just got 50% cheaper! You can now teleoperate your follower arm right from your phone. 🤯 But that's not all. Our new pipeline feature lets you record and train AI models in end-effector space, or with any other features. The possibilities are endless!
11
66
448
@tuzhaopeng
Zhaopeng Tu
9 months
Are expensive labeled data and rejection sampling truly necessary for developing self-improving reasoning models? Introducing Unsupervised Prefix Fine-Tuning (UPFT) -- an efficient method that trains models on only the first 8-32 tokens of single self-generated solutions,
4
29
160
@AndrewYNg
Andrew Ng
2 months
Automated software testing is growing in importance in the era of AI-assisted coding. Agentic coding systems accelerate development but are also unreliable. Agentic testing — where you ask AI to write tests and check your code against them — is helping. Automatically testing
74
205
1K
@Siru_Ouyang
Siru Ouyang
6 months
🚀 Introducing RAST: Reasoning Activation via Small Model Transfer! ✨ RAST adjusts key "reasoning tokens" at decoding time using insights from smaller RL-tuned models — no full RL tuning for large models! ⚡ Efficient & Performant,🧠 Scalable & Easy,📉 Up to 50% less GPU memory!
3
21
117
@arankomatsuzaki
Aran Komatsuzaki
2 months
Self-Improving Embodied FMs • 2-stage recipe: SFT + online RL w/ self-predicted rewards • Boosts sample efficiency: 10% robot time → 45%→75% success (vs. 8× data → only 60%) • Unlocks autonomous skill acquisition beyond imitation data
4
33
247
@Aniket_d98
Aniket Didolkar
2 months
🚨Reasoning LLMs are e̵f̵f̵e̵c̵t̵i̵v̵e̵ ̵y̵e̵t̵ inefficient! Large language models (LLMs) now solve multi-step problems by emitting extended chains of thought. During the process, they often re-derive the same intermediate steps across problems, inflating token usage and
4
35
210
@arankomatsuzaki
Aran Komatsuzaki
2 months
Latent Learning: why LLMs miss future-useful info • AI often fails at latent learning (e.g. reversal curse) • Cognitive science points to episodic memory as a fix • Oracle retrieval shows better generalization • Episodic memory + parametric learning = more human-like AI
8
32
237
@omarsar0
elvis
2 months
Robust tool calling is the key to general agentic intelligence. Easier said than done. This is a fantastic paper on improving and scaling function calling capabilities in AI agents. (bookmark it) Here are my notes:
9
84
432
@omarsar0
elvis
2 months
Very cool work from Meta Superintelligence Lab. They are open-sourcing Meta Agents Research Environments (ARE), the platform they use to create and scale agent environments. Great resource to stress-test agents in environments closer to real apps. Read on for more:
39
184
1K
@ydeng_dandy
Yang Deng
2 months
📣 Check out our #EMNLP2025 paper on a new benchmark for LLMs as Data Science Code Debuggers 🚀
@SauceTesla
Zhiyu Yang
2 months
🔥Thrilled to share: our paper “Why Stop at One Error? Benchmarking LLMs as Data Science Code Debuggers for Multi-Hop and Multi-Bug Errors” was selected as an Oral at EMNLP 2025. 🥳 Paper:
0
1
18
@SauceTesla
Zhiyu Yang
2 months
I’m honored and excited that our paper was selected as an Oral at EMNLP 2025, though sadly I cannot attend in person due to administrative restrictions. My supervisor, Professor Yang Deng, will present in Suzhou. See you all online!
0
0
0
@SauceTesla
Zhiyu Yang
2 months
Resources: đź“„ Paper: https://t.co/EoK7FTxuDO đź’» Code & data: https://t.co/8GsEvsP1Ix This work is supervised by Prof. Yang Deng and in collaboration with my previous mentors @ShuoWang_NLP & Yukun Yan! #EMNLP2025 #LLM #Debugging #DataScience
Tweet card summary image
github.com
Open source data, annotation and evaluation framework for DSDBench paper, accepted at EMNLP 2025 Oral. - KevinCL16/DSDBench
1
0
1
@SauceTesla
Zhiyu Yang
2 months
❗️This reveals a key insight: Agentic systems can “pass” benchmarks by over-rewriting, but DSDBench isolates whether models actually understand faulty execution flows, where trustworthiness really matters.
1
0
0
@SauceTesla
Zhiyu Yang
2 months
We went further: Agentic evaluations. ▶️ A Claude-based Cursor agent achieved ~49% pass@1 on single-bug fixes, higher than standalone Claude’s 34% localization. ▶️ But in unconstrained mode, agents often brute-forced success by rewriting entire code, masking true reasoning gaps.
1
0
0