@kellerjordan0
Random features are quite powerful still, you can get near optimal on ResNet-18/CIFAR-10 (95% acc) by only training the first few layers + last layer, roughly 15% of the weights. The gradients are only large for the first and last layer in this setting.
I’ll be at
#NeurIPS2023
presenting two works at UniReps! I’ll be there all week, so I’m looking to meet up to find collaborators, PhD positions, or just talk about new research. Feel free to reach out.
Links to the papers below ⬇️
I’ll be at
#NeurIPS2023
presenting two works at UniReps! I’ll be there all week, so I’m looking to meet up to find collaborators, PhD positions, or just talk about new research. Feel free to reach out.
Links to the papers below ⬇️
@EvMill
Why not just zero/mask off the output if a head is deemed “unconfident”? This would encourage expert attention heads that only activation when needed, similar to how ReLU creates modularity in MLP’s. Softmax would still be needed for confident heads in this case though.
@finbarrtimbers
Always wondered this. 14B param model still has lots of activations.
In pruning literature, removing individual weights is significantly better than removing neurons because a large representation matters more than the transformation from the weights.
@0xcd16
@vsbuffalo
entirely research focused + 10 min walk to the beach + pretty big school + lack of rowdy students
All makes for a great research school. Only lacks name recognition / selectivity that the other schools have
@elan_learns
@EvMill
Yes, attention heads need modularity/become an expert in different features. Maybe there needs to be some Relu-style component paired with softmax to disable non-confident activations. Dynamic sparsity is essential.
Gradual Fusion Transformer (GraFT) advances ReID in computer vision with fusion tokens. Captures features efficiently and surpasses benchmarks. Optimized for size-performance balance using neural pruning.
Just waiting for someone to open source a foundation-model-sized supernet for LLMs. Massive pretraining costs, yet academics can cheaply sample the search space for their use cases.
@finbarrtimbers
@cosminnegruseri
Look for N:M, mixed, or semi-structured sparsity. Those are all names that I’ve seen for types of Sparsity around this. A100s can do 2:4 sparsity well, but most GPUs don’t so people tend to stick to vanilla structured pruning instead
@kellerjordan0
If you keep track of the gradients during training by keeping a running sum of magnitudes, then with post-training info you can train that same initialization with much fewer gradients (freeze the ones that had the lowest movements)
To find these subnetworks, we leverage distilled data during the retraining stage of IMP to take advantage of the compressed representations. 5/n
thanks!
@hadesinwinter
“I am sorry to inform that after significant consideration, I have to reject this rejection. Best of luck with your other applicants, see you in fall”
@kellerjordan0
The final blocks (besides FC) have the lowest gradient movements where as the middle ones can be still important. I can send checkpoints later today.
For now, I have a high performing model with 90% of the weights frozen and here’s the distribution of frozen weights 👇