Tianyuan Zhang Profile
Tianyuan Zhang

@tianyuanzhang99

Followers
1K
Following
2K
Media
19
Statuses
167

PhDing in@MIT, towards general intelligence and lifelong machine M.S. in CMU, B.S. in PKU.

Boston
Joined September 2017
Don't wanna be here? Send us removal request.
@tianyuanzhang99
Tianyuan Zhang
1 month
Bored of linear recurrent memories (e.g., linear attention) and want a scalable, nonlinear alternative?. Our new paper “Test-Time Training Done Right” propose LaCT (Large Chunk Test-Time Training) — a highly efficient, massively scalable nonlinear memory with:. 💡 Pure PyTorch
6
75
397
@tianyuanzhang99
Tianyuan Zhang
11 days
I feel we need both. compression and sparsity are orthogonal, sometimes even the opposite.
@wenhaocha1
Wenhao Chai
11 days
talke a look at this blog introduce sparse attn and the implementation, which I think currently more promising than compression based method for long-context modeling.
2
0
26
@tianyuanzhang99
Tianyuan Zhang
17 days
RT @Haoyu_Xiong_: Your bimanual manipulators might need a Robot Neck 🤖🦒. Introducing Vision in Action: Learning Active Perception from Huma….
0
84
0
@tianyuanzhang99
Tianyuan Zhang
17 days
RT @Kai__He: 🚀 Introducing UniRelight, a general-purpose relighting framework powered by video diffusion models. 🌟UniRelight jointly model….
0
42
0
@tianyuanzhang99
Tianyuan Zhang
18 days
RT @Bw_Li1024: "Generalization means being able to solve problems that the system hasn't been prepared for.". Our latest work in #RSS2025 c….
0
25
0
@tianyuanzhang99
Tianyuan Zhang
25 days
Just arrived at Nashville for CVPR! Looking forward to chat on any topics!.
0
0
13
@tianyuanzhang99
Tianyuan Zhang
26 days
Thanks Songlin and Xinyu for hosting. Here is the recording and slides.
@SonglinYang4
Songlin Yang
27 days
@tianyuanzhang99 Recording: Slides:
1
3
33
@tianyuanzhang99
Tianyuan Zhang
27 days
Happening in 5 min.
@SonglinYang4
Songlin Yang
27 days
Test-time training (TTT) is an elegant framework for adapting context to model weights. In today’s ASAP seminar (2pm Eastern Time), @tianyuanzhang99 presents Large Chunk TTT (LaCT) — a simple, efficient method combining TTT with chunked attention to unlock new opportunities.
Tweet media one
0
0
17
@tianyuanzhang99
Tianyuan Zhang
27 days
RT @xunhuang1995: Real-time video generation is finally real — without sacrificing quality. Introducing Self-Forcing, a new paradigm for t….
0
120
0
@tianyuanzhang99
Tianyuan Zhang
1 month
RT @SonglinYang4: Check out log-linear attention—our latest approach to overcoming the fundamental limitation of RNNs’ constant state size,….
0
50
0
@tianyuanzhang99
Tianyuan Zhang
1 month
RT @baifeng_shi: Finally! We just released the models and code for PS3 & VILA-HD, a vision encoder **pre-trained at 4K resolution** and the….
0
27
0
@tianyuanzhang99
Tianyuan Zhang
1 month
RT @HanGuo97: We know Attention and its linear-time variants, such as linear attention and State Space Models. But what lies in between?. I….
0
191
0
@tianyuanzhang99
Tianyuan Zhang
1 month
Intelligence needs long-context memories! We hope this work would inspire and accelerate future research in this field. 🙏 Huge shoutout to my amazing co-authors and collaborators @HaoTan5 , @Sai__Bi , @YicongHong , @KaiZhang9546 , @fujun_luan, @SonglinYang4 , Kalyan Sunkavalli,
0
1
16
@tianyuanzhang99
Tianyuan Zhang
1 month
📌Autoregressive Video Generation:. We test LaCT up to scale:  distilling a 14B param video diffusion transformer (WAN-T2V) into an AR video diffusion by replacing full attention with LaCT + SWA. (Generated videos in this tweets all comes from this model.). 8/9
3
1
20
@tianyuanzhang99
Tianyuan Zhang
1 month
📌 Language Models:. Compared to linear memory models like GLA & DeltaNet, LaCT delivers:. 🔟 5-10× larger nonlinear state.⏱ Comparable training wall-clock time.📉 Similar or better loss per token — especially at the last 2K tokens in a sequence.🔍 Similar or better retrieval
Tweet media one
1
0
19
@tianyuanzhang99
Tianyuan Zhang
1 month
📌 Novel View Synthesis aims to render images of a static scene from previously unseen viewpoints given a set of input images. LaCT handles up to 1M tokens, outperforming 3D Gaussian Splatting with up to 128 × 960×536 input images (patch-size as 8x8 -> 1M tokens) on DL3DV
1
1
19
@tianyuanzhang99
Tianyuan Zhang
1 month
We test LaCT on 3 diverse tasks:. 🖼️ Novel View Synthesis (image sets).📝 Language Modeling (1D sequences).🎥 Video Diffusion (sequence of images). Let’s look at each ⬇️. 5/9
Tweet media one
2
1
17
@tianyuanzhang99
Tianyuan Zhang
1 month
We do the opposite!🧠 Update fast weights using extremely huge chunks (2048–1M tokens).This simple idea has profound implications:. 🚀 Parallelism & compute intensity → 10× FLOPs utilization.🦣: Scaling of State size → up to 40% of model params in our exp. 🛠️ Simplicity → no
Tweet media one
1
3
22
@tianyuanzhang99
Tianyuan Zhang
1 month
📚: TTT (Sun et al.) is a new way to design more powerful recurrent models. It propose to adapt a model’s fast weights during inference to store in-context info or learn in-context. It opens a vast design space for new RNNs!. But prior TTT methods suffers from low GPU
1
1
21
@tianyuanzhang99
Tianyuan Zhang
1 month
The core idea behind LaCT (Large-Chunk Test-Time Training) is simple:. 1. Use extremely large online chunk sizes (2K–1M tokens) for ttt to ensure high GPU utilization. 2. Use window attention for local memory, and test-time training (TTT) for non-local memory!. 2/9
1
3
28