Tianyuan Zhang
@tianyuanzhang99
Followers
2K
Following
3K
Media
20
Statuses
184
PhDing in@MIT, towards general intelligence and lifelong machine M.S. in CMU, B.S. in PKU.
Boston
Joined September 2017
Bored of linear recurrent memories (e.g., linear attention) and want a scalable, nonlinear alternative? Our new paper “Test-Time Training Done Right” propose LaCT (Large Chunk Test-Time Training) — a highly efficient, massively scalable nonlinear memory with: 💡 Pure PyTorch
5
88
425
Totally agree, pertaining only works when marginal cost of data is nearly zero.
Tesla - collects 4.3M hours of driving data - every day - for free - to train a 2DoF system (steering + throttle). - yet full autonomy remains unsolved. Frontier robotics startups/labs - collect or purchase 0.01M–1M hours of data - every X month - for millions of dollars - to
4
6
226
The Representation Autoencoders (RAE) by @sainingxie's team is fascinating — a brilliant demonstration that high-dimensional diffusion is indeed feasible. In our latest work on semantic encoders, we align a pretrained foundation encoder (e.g., DINOv2) as a visual tokenizer,
We found that visual foundation encoder can be aligned to serve as tokenizers for latent diffusion models in image generation! Our new paper introduces a new tokenizer training paradigm that produces a semantically rich latent space, improving diffusion model performance🚀🚀.
2
22
233
Congrats! Getting such results with a completely new route!
Our new editing model just entered the top-3 on the image editing leaderboards, ahead of GPT-Image, Qwen-Edit, and Flux-Kontext 🚀 We’re taking a very different research path than most—starting with fine-grained regional editing, and aiming toward image generation that feels as
0
0
4
We found that visual foundation encoder can be aligned to serve as tokenizers for latent diffusion models in image generation! Our new paper introduces a new tokenizer training paradigm that produces a semantically rich latent space, improving diffusion model performance🚀🚀.
7
72
528
Oh man! Why I missed this blog in the summer😇 Deriving derivative of the Muon optimizer. Would be very interesting to try in test time training.
https://t.co/EGIwz9VeME Discussed the derivative calculation of the msign operator. If you are interested in the combination of “TTT + Muon” like https://t.co/u9qW6lWqBH , this might be helpful to you.
0
1
20
Excited to see Gated DeltaNet being adopted in the @Alibaba_Qwen series ! It has also previously demonstrated strong effectiveness in @nvidia's Jet-Nemotron
🚀 Introducing Qwen3-Next-80B-A3B — the FUTURE of efficient LLMs is here! 🔹 80B params, but only 3B activated per token → 10x cheaper training, 10x faster inference than Qwen3-32B.(esp. @ 32K+ context!) 🔹Hybrid Architecture: Gated DeltaNet + Gated Attention → best of speed &
9
54
552
part science, part empirical, part magic. All driven by extreme curiosity!!
I've written the full story of Attention Sinks — a technical deep-dive into how the mechanism was developed and how our research ended up being used in OpenAI's new OSS models. For those interested in the details: https://t.co/0EAi2KQMMx
3
1
39
Model and training code for LaCT on language model, AR video gen and novel view synthesis are released, also have a TTT layer implementation with sequence parallel supported. Both object-centric and scene-level view synthesis checkpoints are released 🤓— come play!
Bored of linear recurrent memories (e.g., linear attention) and want a scalable, nonlinear alternative? Our new paper “Test-Time Training Done Right” propose LaCT (Large Chunk Test-Time Training) — a highly efficient, massively scalable nonlinear memory with: 💡 Pure PyTorch
3
19
116
Compression is the heart of intelligence From Occam to Kolmogorov—shorter programs=smarter representations Meet KARL: Kolmogorov-Approximating Representation Learning. Given an image, token budget T & target quality 𝜖 —KARL finds the smallest t≤T to reconstruct it within 𝜖🧵
14
62
356
Your bimanual manipulators might need a Robot Neck 🤖🦒 Introducing Vision in Action: Learning Active Perception from Human Demonstrations ViA learns task-specific, active perceptual strategies—such as searching, tracking, and focusing—directly from human demos, enabling robust
18
95
427
🚀 Introducing UniRelight, a general-purpose relighting framework powered by video diffusion models. 🌟UniRelight jointly models the distribution of scene intrinsics and illumination, enabling high-quality relighting and intrinsic decomposition from a single image or video.
9
49
165
"Generalization means being able to solve problems that the system hasn't been prepared for." Our latest work in #RSS2025 can automatically invent neural networks as state abstractions, which help robots generalize. Check it out here: https://t.co/RkoR5MRRJg
5
26
123
Thanks Songlin and Xinyu for hosting. Here is the recording and slides.
@tianyuanzhang99 Recording: https://t.co/VMbleDdM6f Slides:
1
3
33
Happening in 5 min
Test-time training (TTT) is an elegant framework for adapting context to model weights. In today’s ASAP seminar (2pm Eastern Time), @tianyuanzhang99 presents Large Chunk TTT (LaCT) — a simple, efficient method combining TTT with chunked attention to unlock new opportunities.
0
1
18
Real-time video generation is finally real — without sacrificing quality. Introducing Self-Forcing, a new paradigm for training autoregressive diffusion models. The key to high quality? Simulate the inference process during training by unrolling transformers with KV caching.
29
142
871