Zhiyu Yang @SauceTesla X Profile

Zhiyu Yang

@SauceTesla

Followers

53

Following

7K

Media

23

Statuses

2K

PhD @ UT Dallas, Prev. RA @ Singapore Management University, Prev. Research Intern @ THUNLP

https://t.co/EZ42j1xa3Z

Joined April 2018

Don't wanna be here? Send us removal request.

Zhiyu Yang

@SauceTesla

2 months

🔥Thrilled to share: our paper “Why Stop at One Error? Benchmarking LLMs as Data Science Code Debuggers for Multi-Hop and Multi-Bug Errors” was selected as an Oral at EMNLP 2025. 🥳 Paper:

arxiv.org

LLMs are transforming software development, yet current code generation and code repair benchmarks mainly assess syntactic and functional correctness in simple, single-error cases. LLMs'...

1

0

19

Matt Ho

@matt_seb_ho

1 month

ArcMemo yields +7.5% relative on ARC-AGI vs o4-mini (same backbone). It extends the LLM idea of “compressing knowledge for generalization” into a lightweight, continually learnable abstract memory—model-agnostic and text-based. Preprint: Lifelong LM Learning via Abstract Memory

4

30

129

Mian Zhang

@_Guuuuuuuu_

2 months

CWM shows that reasoning can benefit from step-by-step simulation of code execution. 🔹 Our latest evaluation results show that CWM achieves 47% accuracy on LogicIFEval, ranking #1 among all tested public models! 📄 LogicIF Paper: https://t.co/SdczQgG3XA This result suggests

AI at Meta

@AIatMeta

2 months

New from Meta FAIR: Code World Model (CWM), a 32B-parameter research model designed to explore how world models can transform code generation and reasoning about code. We believe in advancing research in world modeling and are sharing CWM under a research license to help empower

0

2

6

Bohan Lyu

@Lyubh22

2 months

Hopefully we will also feature this work at LAW@NeurIPS 2025, where the story had already become Demystify the Potential of Large Language Models as World Models of Code when I made the submission last month.

Siqiao Huang

@KnightNemo_

2 months

I want to quietly mention that we basicly came up with the same idea of code execution as world models six months ago, and it turned out to be a EMNLP'25 paper (top 0.5% meta score). Check out: https://t.co/8lKZk9TNSA . Glad to see META pushing it a lot further.

1

2

6

Gabriel Synnaeve

@syhw

2 months

(🧵) Today, we release Meta Code World Model (CWM), a 32-billion-parameter dense LLM that enables novel research on improving code generation through agentic reasoning and planning with world models. https://t.co/BJSUCh2vtg

60

313

2K

Andrew Lampinen

@AndrewLampinen

2 months

Why does AI sometimes fail to generalize, and what might help? In a new paper, we highlight the latent learning gap — which unifies findings from language model weaknesses to agent navigation — and suggest that episodic memory complements parametric learning to bridge it. Thread:

19

101

553

Rohan Paul

@rohanpaul_ai

2 months

Another fantastic @AIatMeta paper. Large language models keep redoing the same work inside long chains of thought, so this paper teaches the model to compress those recurring steps into small named behaviors that it can recall later or even learn into its weights. Shows that

16

87

532

LeRobot

@LeRobotHF

2 months

LeRobot SO101 setup just got 50% cheaper! You can now teleoperate your follower arm right from your phone. 🤯 But that's not all. Our new pipeline feature lets you record and train AI models in end-effector space, or with any other features. The possibilities are endless!

11

66

448

Zhaopeng Tu

@tuzhaopeng

9 months

Are expensive labeled data and rejection sampling truly necessary for developing self-improving reasoning models? Introducing Unsupervised Prefix Fine-Tuning (UPFT) -- an efficient method that trains models on only the first 8-32 tokens of single self-generated solutions,

4

29

160

Andrew Ng

@AndrewYNg

2 months

Automated software testing is growing in importance in the era of AI-assisted coding. Agentic coding systems accelerate development but are also unreliable. Agentic testing — where you ask AI to write tests and check your code against them — is helping. Automatically testing

74

205

1K

Siru Ouyang

@Siru_Ouyang

6 months

🚀 Introducing RAST: Reasoning Activation via Small Model Transfer! ✨ RAST adjusts key "reasoning tokens" at decoding time using insights from smaller RL-tuned models — no full RL tuning for large models! ⚡ Efficient & Performant,🧠 Scalable & Easy,📉 Up to 50% less GPU memory!

3

21

117

Aran Komatsuzaki

@arankomatsuzaki

2 months

Self-Improving Embodied FMs • 2-stage recipe: SFT + online RL w/ self-predicted rewards • Boosts sample efficiency: 10% robot time → 45%→75% success (vs. 8× data → only 60%) • Unlocks autonomous skill acquisition beyond imitation data

4

33

247

Aniket Didolkar

@Aniket_d98

2 months

🚨Reasoning LLMs are e̵f̵f̵e̵c̵t̵i̵v̵e̵ ̵y̵e̵t̵ inefficient! Large language models (LLMs) now solve multi-step problems by emitting extended chains of thought. During the process, they often re-derive the same intermediate steps across problems, inflating token usage and

4

35

210

Aran Komatsuzaki

@arankomatsuzaki

2 months

Latent Learning: why LLMs miss future-useful info • AI often fails at latent learning (e.g. reversal curse) • Cognitive science points to episodic memory as a fix • Oracle retrieval shows better generalization • Episodic memory + parametric learning = more human-like AI

8

32

237

elvis

@omarsar0

2 months

Robust tool calling is the key to general agentic intelligence. Easier said than done. This is a fantastic paper on improving and scaling function calling capabilities in AI agents. (bookmark it) Here are my notes:

9

84

432

elvis

@omarsar0

2 months

Very cool work from Meta Superintelligence Lab. They are open-sourcing Meta Agents Research Environments (ARE), the platform they use to create and scale agent environments. Great resource to stress-test agents in environments closer to real apps. Read on for more:

39

184

1K

Yang Deng

@ydeng_dandy

2 months

📣 Check out our #EMNLP2025 paper on a new benchmark for LLMs as Data Science Code Debuggers 🚀

Zhiyu Yang

@SauceTesla

2 months

🔥Thrilled to share: our paper “Why Stop at One Error? Benchmarking LLMs as Data Science Code Debuggers for Multi-Hop and Multi-Bug Errors” was selected as an Oral at EMNLP 2025. 🥳 Paper:

0

1

18

Zhiyu Yang

@SauceTesla

2 months

I’m honored and excited that our paper was selected as an Oral at EMNLP 2025, though sadly I cannot attend in person due to administrative restrictions. My supervisor, Professor Yang Deng, will present in Suzhou. See you all online!

0

Zhiyu Yang

@SauceTesla

2 months

Resources: 📄 Paper: https://t.co/EoK7FTxuDO 💻 Code & data: https://t.co/8GsEvsP1Ix This work is supervised by Prof. Yang Deng and in collaboration with my previous mentors @ShuoWang_NLP & Yukun Yan! #EMNLP2025 #LLM #Debugging #DataScience

github.com

Open source data, annotation and evaluation framework for DSDBench paper, accepted at EMNLP 2025 Oral. - KevinCL16/DSDBench

1

0

1

Zhiyu Yang

@SauceTesla

2 months

❗️This reveals a key insight: Agentic systems can “pass” benchmarks by over-rewriting, but DSDBench isolates whether models actually understand faulty execution flows, where trustworthiness really matters.

1

0

Zhiyu Yang

@SauceTesla

2 months

We went further: Agentic evaluations. ▶️ A Claude-based Cursor agent achieved ~49% pass@1 on single-bug fixes, higher than standalone Claude’s 34% localization. ▶️ But in unconstrained mode, agents often brute-forced success by rewriting entire code, masking true reasoning gaps.

1

0