randall_balestr Profile Banner
Randall Balestriero Profile
Randall Balestriero

@randall_balestr

Followers
4K
Following
299
Media
185
Statuses
571

AI Researcher: From theory to practice (and back) Postdoc @MetaAI with @ylecun PhD @RiceUniversity with @rbaraniuk Masters @ENS_Ulm @Paris_Sorbonne

USA
Joined April 2020
Don't wanna be here? Send us removal request.
@randall_balestr
Randall Balestriero
2 months
Impressed by DINOv2 perf. but don't want to spend too much $$$ on compute and wait for days to pretrain on your own data? Say no more! Data augmentation curriculum speeds up SSL pretraining (as it did for generative and supervised learning) -> FastDINOv2!.
Tweet media one
4
32
190
@randall_balestr
Randall Balestriero
2 hours
In case a full day of splines isn't enough to make you book your trip to JMM26. Perhaps a 90min session with @ylecun (+ TBD speakers) on self supervised learning/world models WITH math will close the deal!.No matter your background, come in number to discuss research together!
Tweet media one
@randall_balestr
Randall Balestriero
2 days
Interested in splines for AI theory (generalization, Grokking, explainability, generative modeling, . )? Wait no more! We are organizing a dedicated session at JMM26 (the largest math conf.): Consider submitting your abstract! Deadline in 2 weeks!
Tweet media one
0
0
5
@grok
Grok
8 days
Join millions who have switched to Grok.
220
450
3K
@randall_balestr
Randall Balestriero
2 days
But this makes me wonder: if different MAE hparams make you learn different features about your input, can there exist a more universal reconstruction-based pretraining solution? Huge congrats to the MVPs @Abisulco @RahulRam3sh and Pratik from UPenn for making us wonder!.
0
0
6
@randall_balestr
Randall Balestriero
2 days
While we build strong insights from theoretical analysis of simplified MAEs, our findings transfer to nonlinear ViT MAEs opening new ways to select their hyper-parameters if you know a little bit about your task and pretraining data distribution!
Tweet media one
1
0
6
@randall_balestr
Randall Balestriero
2 days
Learning by input-space reconstruction is often inefficient and hard to get right (compared to joint-embedding). While previous theory explains why in the linear/kernel setting, we now take a deep dive into MAEs specifically!.Now on arxiv + #ICCV2025 !
Tweet media one
6
15
82
@randall_balestr
Randall Balestriero
2 days
Interested in splines for AI theory (generalization, Grokking, explainability, generative modeling, . )? Wait no more! We are organizing a dedicated session at JMM26 (the largest math conf.): Consider submitting your abstract! Deadline in 2 weeks!
Tweet media one
0
1
17
@randall_balestr
Randall Balestriero
15 days
RT @NYUDataScience: Congratulations to CDS PhD Student @vlad_is_ai, Courant PhD Student Kevin Zhang, CDS Faculty Fellow @timrudner, CDS Pro….
0
7
0
@randall_balestr
Randall Balestriero
2 months
RT @rpatrik96: I am heading to @icmlconf to present our position paper with @randall_balestr @klindt_david @wielandbr on what we believe ar….
0
7
0
@randall_balestr
Randall Balestriero
2 months
Huge congrats to the amazing @JiaqiZhang82804 Juntuo Wang, Zhixin Sun and John Zou! If you are interested in follow ups for other modalities and/or CLIP, please DM me or shoot me an email!.
1
0
8
@randall_balestr
Randall Balestriero
2 months
Beyond computational benefits, we strongly believe that curriculum of data-augmentation is a **totally unexplored area in SSL research** that could lead to breakthrough in terms of robustness, transfer performances and reduced spurious correlations learning!
Tweet media one
2
0
10
@randall_balestr
Randall Balestriero
2 months
Low to high resolution is the main driver of pretraining speedup. This type of strategy has been used forever in other settings (GANs, supervised learning, . ) and was also tried on SimCLR in But we find that beyond speedups, this improved robustness!
Tweet media one
1
0
7
@randall_balestr
Randall Balestriero
2 months
RT @pszwnzl: Our paper "Beyond [cls]: Exploring the True Potential of Masked Image Modeling Representations" has been accepted to @ICCVConf….
0
23
0
@randall_balestr
Randall Balestriero
2 months
We also have theoretical justifications of CT, in particular, CT does a Sobolev space projection of your DN! This is only a first step towards provable methods improving SOTAs. Huge congrats to @leonleyanghu and Matteo Gamba!
Tweet media one
1
0
2
@randall_balestr
Randall Balestriero
2 months
But smoothing your boundary also means better adversarial robustness! Which again holds out of the box with CT without having to do any training/finetuning of the original model weights
Tweet media one
1
0
3
@randall_balestr
Randall Balestriero
2 months
Smoothing the model curvature means improving transfer learning performances (duh) but we actually compete with LoRA albeit having much less trainable parameters (even with LoRA rank1). This holds across pretty much all datasets/models we tried
Tweet media one
1
0
1
@randall_balestr
Randall Balestriero
2 months
CT implementation is super simple, simply switch the activation function in your deep network with ours, and manually tune or learn the corresponding curvature parameter! This works e.g. on Resnets but also on ViTs! And the overhead is minimal (less than LoRA)
Tweet media one
2
0
3
@randall_balestr
Randall Balestriero
2 months
CurvatureTuning2.0! Provably steer/finetune your model curvature without changing its parameters in a way that LoRA provably can't do!.- strong theory (Sobolev space projection).- strong experiments (dozens of models/datasets, manual/learnable steering).
4
9
56
@randall_balestr
Randall Balestriero
2 months
Huge congrats to MVP @thomas_M_walker @imtiazprio and @rbaraniuk ! Ping us with questions or comments!.
0
0
6
@randall_balestr
Randall Balestriero
2 months
Our solution is much faster than alternatives e.g. weight regularization, adversarial training, or even GrokFast (or rather GrokNotSoFast apparently), and it is computationally tractable requiring nearly no change in your existing training pipeline!
Tweet media one
1
0
8
@randall_balestr
Randall Balestriero
2 months
Based on that theory, we found a regularizer to speed up the convergence of the Jacobian matrices resulting in faster Grokking and better training dynamics in general, that work across transformer and non-transformer architectures! Play with the code:
Tweet media one
1
0
7