Using parallel decoding to speed up inference of LLM: ✅ no need for a second model ✅ not finetuning ✅ negligible memory overhead Tweet added by Armand Joulin @armandjoulin

Armand Joulin

7 months

Using parallel decoding to speed up inference of LLM: ✅ no need for a second model ✅ not finetuning ✅ negligible memory overhead

Giovanni Monea

@giomonea

7 months

🎉 Unveiling PaSS: Parallel Speculative Sampling 🚀 Need faster LLM decoding? 🔗 Check out our new 1-model speculative sampling algorithm based on parallel decoding with look-ahead tokens: 🤝 In collaboration with @armandjoulin and @EXGRV