Tian Jin
@tjingrant
Followers
591
Following
340
Media
18
Statuses
218
PhD student @MIT_CSAIL, previously @IBMResearch, @haverfordedu .
Cambridge, Massachusetts
Joined March 2015
Introducing Learned Asynchronous Decoding w/ friends from MIT/Google! LLM responses often have chunks of tokens that are semantically independent. We train LLMs to identify and decode them in parallel, speeding up inference by 1.46x geomean (AlpacaEval) w/ only 1.3% quality loss.
4
15
70
Explore Eigen Banana, out post trained image edit model with lightning fast speed! โก๏ธ
๐ Releasing open-source Eigen-Banana-Qwen-Image-Edit: 4 seconds โก instruction-based image edits trained on Pico-Banana-400K. Super fast with high image editing quality. Open-source LoRA for Diffusers/DiffSynth-Studio + enterprise stack (EigenTrain/Inference/Deploy). Feel free
0
2
14
Releasing QuTLASS v0.2: fast, end-to-end quantization-aware training (QAT) with kernel support and applications! 1. Nanochat-QAT: a fully-quantized extension of @karpathy 's nanochat 2. General QAT recipe with MXFP4 forward/MXFP8 backward GEMMs 3. Transformers/vLLM integrations
1
38
157
ๅผๆญฅ็ๆ + ๆๆฌDiffusion+ ๅ่ฎฎ token ่งๅ! From awsome @tjingrant @danielmisrael
2
2
10
Diffusion ๐ค Autoregressive Fast high-quality generation
"An hour of planning can save you 10 hours of doing." โจ๐ Planned Diffusion ๐ โจ makes a plan before parallel dLLM generation. Planned Diffusion runs 1.2-1.8ร faster than autoregressive and an order of magnitude faster than diffusion, while staying within 0.9โ5% AR quality.
0
2
2
Plan autoregressively, denoise in parallel!
"An hour of planning can save you 10 hours of doing." โจ๐ Planned Diffusion ๐ โจ makes a plan before parallel dLLM generation. Planned Diffusion runs 1.2-1.8ร faster than autoregressive and an order of magnitude faster than diffusion, while staying within 0.9โ5% AR quality.
0
2
5
Earlier this year, we introduced the idea of learned asynchronous decoding. Now we've brought it to diffusion!
"An hour of planning can save you 10 hours of doing." โจ๐ Planned Diffusion ๐ โจ makes a plan before parallel dLLM generation. Planned Diffusion runs 1.2-1.8ร faster than autoregressive and an order of magnitude faster than diffusion, while staying within 0.9โ5% AR quality.
0
3
15
Plan autoregressively, denoise in parallel!
"An hour of planning can save you 10 hours of doing." โจ๐ Planned Diffusion ๐ โจ makes a plan before parallel dLLM generation. Planned Diffusion runs 1.2-1.8ร faster than autoregressive and an order of magnitude faster than diffusion, while staying within 0.9โ5% AR quality.
0
2
5
"An hour of planning can save you 10 hours of doing." โจ๐ Planned Diffusion ๐ โจ makes a plan before parallel dLLM generation. Planned Diffusion runs 1.2-1.8ร faster than autoregressive and an order of magnitude faster than diffusion, while staying within 0.9โ5% AR quality.
7
46
311
Super bullish on intra-layer hybridization LLM. These are the reasons why.
3
154
574
NYC open-source AI infra contributors โ weโve launched a community research hub above Grand Central where GPUs go brrr ๐ฅ๐ฝ A place to hack, benchmark, and collaborate โ vLLM, SGLang, kernels, inference optimizations all welcome. Open space. Open source. Weekends too. Huge
7
10
89
๐ Excited to share our work at Bytedance Seed! Knapsack RL: Unlocking Exploration of LLMs via Budget Allocation ๐ Exploration in LLM training is crucial but expensive. Uniform rollout allocation is wasteful: โ
Easy tasks โ always solved โ 0 gradient โ Hard tasks โ
13
102
642
Introducing LLM.Q: Quantized LLM training in pure CUDA/C++! With LLM.Q, you can train your own LLM on consumer GPUs with natively quantized matmuls, on single workstations. No datacenter required. Inspired by @karpathy's llm.c, but natively quantized.
3
16
141
๐ฆAdaptive Parallel Decoding (APD) has been accepted as a spotlight paper at @NeurIPSConf ! I thank my collaborators, reviewers, and program organizers for this honor. A thread for those interested ๐งต (1/n)
11
23
170
Congrats Xinyu!
๐ Excited to share that #Multiverse has been accepted to #NeurIPS 2025! Couldnโt have done it without such incredible collaboratorsโthank you!!
0
0
1
Excited to share what friends and I have been working on at @Standard_Kernel We've raised from General Catalyst (@generalcatalyst), Felicis (@felicis), and a group of exceptional angels. We have some great H100 BF16 kernels in pure CUDA+PTX, featuring: - Matmul 102%-105% perf
52
92
993
๐ Excited to announce QuTLASS v0.1.0 ๐ QuTLASS is a high-performance library for low-precision deep learning kernels, following NVIDIA CUTLASS. The new release brings 4-bit NVFP4 microscaling and fast transforms to NVIDIA Blackwell GPUs (including the B200!) [1/N]
3
35
220