Wei Liu
@WeiLiu99
Followers
604
Following
3K
Media
18
Statuses
620
#NLProc | Ph.D. Student @hkust @hkustnlp | Prev. @AlibabaGroup @ShanghaiTechUni
Joined February 2018
โWhat is the answer of 1 + 1?โ Large Reasoning Models (LRMs) may generate 1500+ tokens just to answer this trivial question. Too much thinking ๐คฏ Can LRMs be both Faster AND Stronger? Yes. Introducing LASER๐ฅ: Learn to Reason Efficiently with Adaptive Length-based Reward Shaping
2
33
142
Big thanks to @_akhaliq for sharing our work! ๐ Flash-DMD decouples the DMD objective, leads to extremely fast distillation convergence. In the second stage, we perform joint reinforcement while distilling , using the distillation loss as a natural regularization. Check it out๐ค
Flash-DMD Towards High-Fidelity Few-Step Image Generation with Efficient Distillation and Joint Reinforcement Learning
0
4
17
Excited to share our @NeurIPSConf Tutorial on How to Build Agents to Generate Kernels for Faster LLMs (and Other Models!) A collaboration across institutions: @AMD, @Stanford, @GoogleDeepMind, @Arm, @NVIDIAAI, @Meta, @Modular, @UCIrvine, @MLCommons. - If you're an AI
12
37
274
๐ From simple code completion to autonomous software engineering agents โ what changed in the past 5 years? We wrote the playbook ๐ "๐
๐ซ๐จ๐ฆ ๐๐จ๐๐ ๐
๐จ๐ฎ๐ง๐๐๐ญ๐ข๐จ๐ง ๐๐จ๐๐๐ฅ๐ฌ ๐ญ๐จ ๐๐ ๐๐ง๐ญ๐ฌ" โ 300 pages covering exact recipes ๐งช, scaling laws ๐ & RL techniques ๐ฏ
2
35
175
At NeoCognition, we aim for the essentials: 1. Deliver applied research that turns agents into true business value; 2. Explore fundamental questionsโnot just scaling. If youโre an agent believer who wants to build differently, send your CV to hiring@neocognition.io. I wonโt be at
Life update: I moved to silicon valley to tackle agents' biggest challenges: plasticity and reliability. Today's agents are smart but brittle. They lack plasticity (continual learning and adaptation) and reliability (stable, predictable behavior with bounded failures). These two
0
8
48
Congrats on the fantastic DeepSeek-V3.2 update! Honored to see our Toolathlon benchmark ( https://t.co/RwOa7RxyKf) being used and highlighted for tool-using evaluation. Still amazed by how fast the community is moving โ open-source models can now score 35+ on this benchmark. Iโm
toolathlon.xyz
๐ Launching DeepSeek-V3.2 & DeepSeek-V3.2-Speciale โ Reasoning-first models built for agents! ๐น DeepSeek-V3.2: Official successor to V3.2-Exp. Now live on App, Web & API. ๐น DeepSeek-V3.2-Speciale: Pushing the boundaries of reasoning capabilities. API-only for now. ๐ Tech
0
0
5
Very honored to see Tool Decathlon being used and highlighted on the first page of DeepSeek-V3.2 paper! We have also updated some new models on our benchmark ( https://t.co/DkByeP6rii), and now we finally have DeepSeek-V3.2 as the first Open-Source one >35, great achievement!
๐ Launching DeepSeek-V3.2 & DeepSeek-V3.2-Speciale โ Reasoning-first models built for agents! ๐น DeepSeek-V3.2: Official successor to V3.2-Exp. Now live on App, Web & API. ๐น DeepSeek-V3.2-Speciale: Pushing the boundaries of reasoning capabilities. API-only for now. ๐ Tech
0
5
49
As a longtime fan of DeepSeek, I am excited to see DeepSeek-V3.2 progresses fast on our Tool Decathlon bench! ๐Update: We have deployed Toolathlon eval as a public service, now you can evaluate on Toolathlon without setting up anything:
github.com
The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution - hkust-nlp/Toolathlon
๐ Launching DeepSeek-V3.2 & DeepSeek-V3.2-Speciale โ Reasoning-first models built for agents! ๐น DeepSeek-V3.2: Official successor to V3.2-Exp. Now live on App, Web & API. ๐น DeepSeek-V3.2-Speciale: Pushing the boundaries of reasoning capabilities. API-only for now. ๐ Tech
0
4
94
๐จNew Blog Alert: Is AdamW an overkill for RLVR? We found that vanilla SGD is 1. As performant as AdamW, 2. 36x more parameter efficient naturally. (much more than a rank 1 lora) ๐คฏ Looks like a "free lunch". Maybe Itโs time to rethink the optimizers for RLVR ๐งต
16
57
472
๐ Excited to share ToolOrchestra, an end-to-end RL training framework for orchestrating tools and agentic workflows. Everyoneโs building agent workflows these days โ connecting tools, APIs, and LLMs like LEGO. ๐งฉ But here are our findings: ๐ Just prompting the agent workflow
25
68
312
We use latent continuous thoughts for retrieval optimized via downstream NTP loss, unified under one LLM backbone. Since representations are shared, documents can be precomputedโeliminating 2-stage RAG. We match raw text performance but with a much shorter context budget. ๐๐
Happy to introduce my internship work at @Apple . We introduce CLaRa: Continuous Latent Reasoning, an end-to-end training framework that jointly trains retrieval and generation ! ๐ง ๐ฆ ๐ https://t.co/jEapFfeD7D
#RAG #LLMs #Retrieval #Reasoning #AI
1
9
31
Gemini-3-Pro improves Gemini-2.5-Pro from 10.5% to 36.4% on Toolathlon! Only one step away from Claude-4.5-Sonnet now, very impressive
๐We are excited to introduce the Tool Decathlon (Toolathlon), a benchmark for language agents on diverse, complex, and realistic tool use. โญ๏ธ32 applications and 600+ tools based on real-world software environments โญ๏ธExecution-based, reliable evaluation โญ๏ธRealistic, covering
1
7
32
Releasing a new "Agentic Reviewer" for research papers. I started coding this as a weekend project, and @jyx_su made it much better. I was inspired by a student who had a paper rejected 6 times over 3 years. Their feedback loop -- waiting ~6 months for feedback each time -- was
238
1K
6K
While Asynchronous RL is heating up, our algo folks walked in and said: "Synchronous/On-policy guarantees OR high efficiency? No, we want BOTH." So we dropped Seer ๐ Seer: Online Context Learning for Fast Synchronous LLM Reinforcement Learning. ๐ Read here:
arxiv.org
Reinforcement Learning (RL) has become critical for advancing modern Large Language Models (LLMs), yet existing synchronous RL systems face severe performance bottlenecks. The rollout phase, which...
16
88
696
Building a generative verifier for proof verification, but it was not very successful - models tend to hack using surface-level features. How can we assign non-hackable reward for this?
1
7
67
๐จ New Work! ๐ค Is RL black-box weight tinkering? ๐ No. We provably show RLVR follows a ๐งญย โ always updating the same off-principal regions while preserving the model's core spectra. โ ๏ธ Different optimization regime than SFT โ SFT-era PEFT tricks can misfire(like PiSSA, the
7
42
257
RL is bounded by finite data๐ฃ? Introducing RLVE: RL with Adaptive Verifiable Environments We scale RL with data procedurally generated from 400 envs dynamically adapting to the trained model ๐กfind supervision signals right at the LM capability frontier + scale them ๐in๐งต
12
115
472
1/ Great thread by @IdanShenfeld on Policy Mirror Descent (PMD) and its likely use in models like Kimi K2. It's a powerful technique for stabilizing RL. We'd like to highlight our NeurIPS 2019 work which was one of the first to frame policy optimization as a mirror descent
Everyoneโs talking about Kimi K2 Thinking and its impressive performance. No full report yet, but judging from Kimi K2\1.5 reports, it likely uses Policy Mirror Descent - an RL trick thatโs quietly becoming standard in frontier labs. Letโs break down what it is:
1
7
53
Can we run RL to train LLMs on hard-to-verify or open-ended tasks? Even when tasks are verifiable, it is often impossible to check every design detail or catch all mistakes.. We can go prompt-tune LLM judges, but is that really the answer? Our new paper introduces RLAC: a
9
58
349
๐จ UPDATE to the "1 bit per episode" analysis (inspired by @johnschulman's post at @thinkymachines ): After discussion with @mgostIH, I ned to points out the limit only applies to *scalar advantage*! REINFORCE with per-timestep advantages can learn O(T) bits when rewards are
Inspired by @thinkymachines 's "#LoRA Without Regret" post, I formalized their insight that policy gradient learns ~1 bit per episode via Bayesian #RL formulation. I prove this is a hard information-theoretic ceiling and extend the analysis to actor-critic methods. Full writeup
1
8
18