@armandjoulin
Armand Joulin
7 months
Using parallel decoding to speed up inference of LLM: ✅ no need for a second model ✅ not finetuning ✅ negligible memory overhead
@giomonea
Giovanni Monea
7 months
🎉 Unveiling PaSS: Parallel Speculative Sampling 🚀 Need faster LLM decoding? 🔗 Check out our new 1-model speculative sampling algorithm based on parallel decoding with look-ahead tokens: 🤝 In collaboration with @armandjoulin and @EXGRV
7
10
51
0
5
31