Weijie Su
@weijie444
Followers
6K
Following
4K
Media
55
Statuses
865
Associate Professor @Wharton & CS Penn. coDir @Penn Research #MachineLearning. PhD @Stanford. #MachineLearng #DeepLearning #Statistics #Privacy #Optimization.
Philadelphia, PA
Joined September 2011
Back from NeurIPS with one fun observation: Two different communities--optimization and deep learning theory--were both talking about Muon (aka spectral GD). https://t.co/YEG65gGLqG Lots of emerging (and sometimes contradictory) takes. I’ve got two perspectives sketched in the
Why and how does gradient/matrix orthogonalization work in Muon for training #LLMs? We introduce an isotropic curvature model to explain it. Take-aways: 1. Orthogonalization is a good idea, "on the right track". 2. But it might not be optimal. [1/n]
8
23
221
The top under-21's in the world assemble in Jeddah for the Next Gen ATP Finals. Subscribe to watch live and get ready for a massive 2026 season.
1
2
9
Our ICML 2026 Policy for LLM use in Reviewing
Announcing the ICML 2026 policy for LLMs in reviewing! Reviewers and authors both pick either conservative or permissive LLM use, and will be matched accordingly. Importantly: authors on papers who choose conservative must obey the conservative policy as reviewers.
3
3
18
Doctoral student @HeWeiqing86254 presented his @NeurIPSConf 2025 research on using statistical tests to help detect AI-generated text. This paper was co-authored by Weiqing He, Xiang Li & Tianqi Shang, along with Profs. @lishenlc, @weijie444 & @DrQiLong. https://t.co/pVUqk8URpJ
0
3
7
Of course, citation analysis is tricky. There are many confounders: • Early arXiv visibility 🗓️ • Author fame 🌟 • "Hot" topics 🔥 We tried our best to control for these factors. However, given that top-ranked papers consistently receive ~2x the citations of lower-ranked
0
1
1
Main results of *How to Find Fantastic AI Papers: Self-Rankings as a Powerful Predictor of Scientific Impact Beyond Peer Review* are shown in Figure 2: We grouped papers by how authors privately ranked them. Analysis based on our ICML 2023 ranking experiment
0
1
2
Just landed in SD for #NeurIPS2025. With 5K accepted papers, how to find *Fantastic* AI papers? Solution: ask the authors to rank their own papers Results: Papers ranked #1 by authors received 2x more citations than those they ranked last Paper: https://t.co/oboZXfjdEC
5
8
64
A bit of tech details: We want to maximize the weighted sum (\sum_t \text{freq}(t)\times |t|). This leads to viewing tokenization as a **graph-partitioning problem**, where characters form a weighted graph and merges correspond to partitions that maximize this objective.
0
1
12
Here are the comparisons between the Length-MAX tokenizer and BPE:
2
1
9
Futures Traders: Now you can get 30% off for life, never pay an activation fee, and get 100% initial test refunds with your first withdrawal! Use code NOFEE100 to claim this limited time offer.
0
8
53
A new tokenizer is introduced for LLMs: https://t.co/Zuerv1jsZ4 Idea: Instead of merging tokens by frequency (BPE), optimize the tokenizer directly for maximizing average token length, yielding longer, more efficient tokens. Results: 14–18% fewer tokens, faster training &
15
68
453
Will present our recent work at NeurIPS with wonderful students and faculty @lihua_lei_stat on inference under data feedback loops https://t.co/f8ElEKSslE, characterizing the exact limit distribution of repeated training without non-asymptotic error compounding.
0
3
11
Heading to SD for #NeurIPS2025 from Dec 4 to 7. Happy to meet and chat.
4
4
54
Magnesiacore. Indoor Pool Drywall and Ceilings is a True Marine Wallboard. When it comes to indoor pools, natatoriums, or other enclosed spaces with high humidity and the moist chemical exposure it entails, you need a construction board that can withstand these elements.
19
39
227
Honored to follow in the footsteps of so many other great researchers at Penn that I admire.
Congratulations to Aaron Roth (@Aaroth), the Henry Salvatori Professor of Computer & Cognitive Science (@cis_penn), for receiving the 2025-26 George H. Heilmeier Faculty Award for Excellence in Research. Roth has been recognized for his fundamental contributions "to formalizing,
12
5
131
So a week ago, I was complaining gpt 5 doesn't write latex. Gemini 3 is much worse. Basically nothing renders
129
29
879
Two super talented student collaborators @yuhuang42 and @Zixin_Wen developed new theory for length-generalizable CoT reasoning!
Excited to share our recent work! We provide a mechanistic understanding of long CoT reasoning in state-tracking: when do transformers length-generalize strongly, when they stall, and how recursive self-training pushes the boundary. 🧵(1/8)
0
2
33
The history of the world is the history of secrets. For thousands of years, we’ve chased one question: how do we protect what matters? Step into History of Secrets and see how we got from ancient ciphers to today’s digital world. 🕛 Learn more about each era.
0
0
5
We're excited to announce the call for papers for #ICML 2026: https://t.co/RDT3zVZDYX See you in Seoul next summer!
🎉ICML 2026 Call for Papers (& Position Papers) has arrived!🎉 A few key changes this year: - Attendance for authors of accepted papers is optional - Originally submitted version of accepted papers will be made public - Cap on # of papers one can be reciprocal reviewer for ...
0
3
21
This suggests that one really should treat the gradient as a matrix for deep learning optimization, and Muon is effective. However, the 'ultimate' optimal method should not involve exact orthogonalization. [5/n]
1
0
5
Theorem 2: when the curvature has a kink ('takes off' suddenly), then matrix orthogonalization is OPTIMAL! This suggests Muon is optimal by assuming an extreme-case of the curvature. Vice versa, a kink is necessary: if orthogonalization is optimal, then the curvature must take
0
0
8