Tianyuan Zhang @tianyuanzhang99 X Profile

Tianyuan Zhang

@tianyuanzhang99

Followers

1K

Following

2K

Media

19

Statuses

167

PhDing in@MIT, towards general intelligence and lifelong machine M.S. in CMU, B.S. in PKU.

Boston

Joined September 2017

Don't wanna be here? Send us removal request.

Tianyuan Zhang

@tianyuanzhang99

1 month

Bored of linear recurrent memories (e.g., linear attention) and want a scalable, nonlinear alternative?. Our new paper “Test-Time Training Done Right” propose LaCT (Large Chunk Test-Time Training) — a highly efficient, massively scalable nonlinear memory with:. 💡 Pure PyTorch

6

75

397

Tianyuan Zhang

@tianyuanzhang99

11 days

I feel we need both. compression and sparsity are orthogonal, sometimes even the opposite.

Wenhao Chai

@wenhaocha1

11 days

talke a look at this blog introduce sparse attn and the implementation, which I think currently more promising than compression based method for long-context modeling.

2

0

26

Tianyuan Zhang

@tianyuanzhang99

17 days

RT @Haoyu_Xiong_: Your bimanual manipulators might need a Robot Neck 🤖🦒. Introducing Vision in Action: Learning Active Perception from Huma….

0

84

0

Tianyuan Zhang

@tianyuanzhang99

17 days

RT @Kai__He: 🚀 Introducing UniRelight, a general-purpose relighting framework powered by video diffusion models. 🌟UniRelight jointly model….

0

42

0

Tianyuan Zhang

@tianyuanzhang99

18 days

RT @Bw_Li1024: "Generalization means being able to solve problems that the system hasn't been prepared for.". Our latest work in #RSS2025 c….

0

25

0

Tianyuan Zhang

@tianyuanzhang99

25 days

Just arrived at Nashville for CVPR! Looking forward to chat on any topics!.

0

13

Tianyuan Zhang

@tianyuanzhang99

26 days

Thanks Songlin and Xinyu for hosting. Here is the recording and slides.

Songlin Yang

@SonglinYang4

27 days

@tianyuanzhang99 Recording: Slides:

1

3

33

Tianyuan Zhang

@tianyuanzhang99

27 days

Happening in 5 min.

Songlin Yang

@SonglinYang4

27 days

Test-time training (TTT) is an elegant framework for adapting context to model weights. In today’s ASAP seminar (2pm Eastern Time), @tianyuanzhang99 presents Large Chunk TTT (LaCT) — a simple, efficient method combining TTT with chunked attention to unlock new opportunities.

0

17

Tianyuan Zhang

@tianyuanzhang99

27 days

RT @xunhuang1995: Real-time video generation is finally real — without sacrificing quality. Introducing Self-Forcing, a new paradigm for t….

0

120

0

Tianyuan Zhang

@tianyuanzhang99

1 month

RT @SonglinYang4: Check out log-linear attention—our latest approach to overcoming the fundamental limitation of RNNs’ constant state size,….

0

50

0

Tianyuan Zhang

@tianyuanzhang99

1 month

RT @baifeng_shi: Finally! We just released the models and code for PS3 & VILA-HD, a vision encoder **pre-trained at 4K resolution** and the….

0

27

0

Tianyuan Zhang

@tianyuanzhang99

1 month

RT @HanGuo97: We know Attention and its linear-time variants, such as linear attention and State Space Models. But what lies in between?. I….

0

191

0

Tianyuan Zhang

@tianyuanzhang99

1 month

Intelligence needs long-context memories! We hope this work would inspire and accelerate future research in this field. 🙏 Huge shoutout to my amazing co-authors and collaborators @HaoTan5 , @Sai__Bi , @YicongHong , @KaiZhang9546 , @fujun_luan, @SonglinYang4 , Kalyan Sunkavalli,

0

1

16

Tianyuan Zhang

@tianyuanzhang99

1 month

📌Autoregressive Video Generation:. We test LaCT up to scale: distilling a 14B param video diffusion transformer (WAN-T2V) into an AR video diffusion by replacing full attention with LaCT + SWA. (Generated videos in this tweets all comes from this model.). 8/9

3

1

20

Tianyuan Zhang

@tianyuanzhang99

1 month

📌 Language Models:. Compared to linear memory models like GLA & DeltaNet, LaCT delivers:. 🔟 5-10× larger nonlinear state.⏱ Comparable training wall-clock time.📉 Similar or better loss per token — especially at the last 2K tokens in a sequence.🔍 Similar or better retrieval

1

0

19

Tianyuan Zhang

@tianyuanzhang99

1 month

📌 Novel View Synthesis aims to render images of a static scene from previously unseen viewpoints given a set of input images. LaCT handles up to 1M tokens, outperforming 3D Gaussian Splatting with up to 128 × 960×536 input images (patch-size as 8x8 -> 1M tokens) on DL3DV

1

19

Tianyuan Zhang

@tianyuanzhang99

1 month

We test LaCT on 3 diverse tasks:. 🖼️ Novel View Synthesis (image sets).📝 Language Modeling (1D sequences).🎥 Video Diffusion (sequence of images). Let’s look at each ⬇️. 5/9

2

1

17

Tianyuan Zhang

@tianyuanzhang99

1 month

We do the opposite!🧠 Update fast weights using extremely huge chunks (2048–1M tokens).This simple idea has profound implications:. 🚀 Parallelism & compute intensity → 10× FLOPs utilization.🦣: Scaling of State size → up to 40% of model params in our exp. 🛠️ Simplicity → no

1

3

22

Tianyuan Zhang

@tianyuanzhang99

1 month

📚: TTT (Sun et al.) is a new way to design more powerful recurrent models. It propose to adapt a model’s fast weights during inference to store in-context info or learn in-context. It opens a vast design space for new RNNs!. But prior TTT methods suffers from low GPU

1

21

Tianyuan Zhang

@tianyuanzhang99

1 month

The core idea behind LaCT (Large-Chunk Test-Time Training) is simple:. 1. Use extremely large online chunk sizes (2K–1M tokens) for ttt to ensure high GPU utilization. 2. Use window attention for local memory, and test-time training (TTT) for non-local memory!. 2/9

1

3

28