@neilhoulsby
Neil Houlsby
9 months
(V-)MoEs working well at the small end of the scale spectrum. Often we actually saw the largest gains with small models so it's a promising direction. Also, the authors seem to have per-image routing working well, which is nice.
@_akhaliq
AK
9 months
Mobile V-MoEs: Scaling Down Vision Transformers via Sparse Mixture-of-Experts paper page: Sparse Mixture-of-Experts models (MoEs) have recently gained popularity due to their ability to decouple model size from inference efficiency by only activating a…
Tweet media one
2
55
238
1
5
47

Replies

@MrCatid
catid (e/acc)
9 months
@neilhoulsby @_akhaliq Does 5% gain seem worth the incredible amount of additional complexity?
0
0
1