Daniel Israel
@danielmisrael
Followers
1K
Following
196
Media
17
Statuses
63
PhD Student @UCLA Working on faster LLM inference algorithms.
Joined October 2011
π¦Adaptive Parallel Decoding (APD) has been accepted as a spotlight paper at @NeurIPSConf ! I thank my collaborators, reviewers, and program organizers for this honor. A thread for those interested π§΅ (1/n)
11
22
168
"An hour of planning can save you 10 hours of doing." β¨π Planned Diffusion π β¨ makes a plan before parallel dLLM generation. Planned Diffusion runs 1.2-1.8Γ faster than autoregressive and an order of magnitude faster than diffusion, while staying within 0.9β5% AR quality.
7
45
308
Special thanks to all of my collaborators! @tjingrant @ellieyhc @guyvdb @adityagrover_ @suvinay @mcarbin Website: https://t.co/P78FEsWaSl Paper: https://t.co/J8EfYNKseY Github:
github.com
Contribute to planned-diffusion/planned-diffusion development by creating an account on GitHub.
1
0
7
Adjusting the # of diffusion steps wrt planned output length produces a smooth trade-off between quality and speed. More steps β higher quality/more latency Less steps β faster/lower quality Planned Diffusion provides inference time knobs to easily configure the tradeoff. (5/6)
1
0
6
We evaluated Planned Diffusion against decoding with AR/diffusion models on AlpacaEval. Planned Diffusion is faster than AR and diffusion, even with aggressively parallel dLLM configurations (Fast-dLLM), while maintaining comparable quality. (4/6)
1
0
7
We trained on an annotated dataset with custom control tags that describes the plan for each example and used a custom attention mask for AR/diffusion hybrid models. During inference, our interpreter parses these control tags and alternate between AR/diffusion modes. (3/6)
1
0
9
β¨π Planned Diffusion π β¨ is a single hybrid model alternating between autoregressive planning and diffusion denoising for faster text generation. On AlpacaEval, planned diffusion achieved 1.27-1.81x speedup over autoregressive generation with only 0.87-5.4% quality drop!
1
0
9
"An hour of planning can save you 10 hours of doing." β¨π Planned Diffusion π β¨ makes a plan before parallel dLLM generation. Planned Diffusion runs 1.2-1.8Γ faster than autoregressive and an order of magnitude faster than diffusion, while staying within 0.9β5% AR quality.
7
45
308
If you are interested in KV cache compression or LLM efficiency, I think this is a must-read. Your KV eviction policy is not as robust as you think, but it can be fixed in a simple way. @itisalex3 was great to work with, and I highly recommend him to any potential PhD advisor.
What happens when we compress the KV cache of prompts with multiple instructions? π€ Existing compression methods can lead to some instructions being ignored. π We propose simple changes to KV cache eviction that fix this problem alongside other pitfalls to be aware of. π―
0
0
7
(1/n) We are excited to announce LaViDa-O, a state-of-the-art unified diffusion LM for image understanding, generation, and editing. Building on our NeurIPS Spotlight submission LaViDa, LaViDa-O offers up to 6.8x speed compared with AR mdoels with high output quality.
2
14
49
Special thanks to my advisors @guyvdb and @adityagrover_ . Also huge thanks to @OliverBroadrick who played a crucial role in brainstorming. (7/n) Github: https://t.co/AFZ1GNTXIH Paper:
arxiv.org
The generation speed of LLMs are bottlenecked by autoregressive decoding, where tokens are predicted sequentially one by one. Alternatively, diffusion large language models (dLLMs) theoretically...
0
0
10
We see massive speedups with APD on benchmarks and very little performance degradation. APD offers three parameters (R, W, M) to tradeoff and capture the Pareto frontier. We can get as much as a 10x increase in throughput over vanilla diffusion generation. (6/n)
1
0
8
APD uses a small autoregressive model to capture the joint dependencies that a diffusion model cannot in a single forward pass. The algorithm bears some similarities to drafting and verifying in speculative decoding. Please see the full paper for more details (5/n)
1
0
7
Empirically, increasing the number of tokens sampled in parallel causes performance on downstream tasks to go down. This can be seen in open source dLLMs Dream and Llada. While the tradeoff between speed and quality will always exist, it doesn't need to be so clear-cut (4/n)
1
0
7
dLLMs can sample in parallel, but sampling from the conditional marginals independently will cause fundamental issues when trying to model the joint distribution. See the below example (from the paper Discrete Copula Diffusion). (3/n)
1
0
6
This is a paper about speeding up diffusion language models (dLLMs). Hereβs a quick summary of the differences between diffusion and autoregressive LLMs. (2/n)
1
0
6
Thanks AK for sharing our work! Unlike autoregressive LLMs, diffusion LLMs can be conditioned on future reasoning hints during generation through inpainting π§©, enabling guided exploration toward correct solutions. We show that applying inpainting-guided exploration in RL
4
28
190
π Introducing PhysiX: One of the first large-scale foundation models for physics simulations! PhysiX is a 4.5B parameter model that unifies a wide range of physical systems, from fluid dynamics to reaction-diffusion, outperforming specialized, state-of-the-art models.
21
251
2K
(1/6)Our work Reflect-DiT was accepted to #ICCV2025 ! Reflect-DiT allows the model to reflect on its past generations and textual feedback to self-correct and improve, extending reasoning to text-to-image generation.
1
23
93