annabelle_cs Profile Banner
Annabelle Michael Carrell Profile
Annabelle Michael Carrell

@annabelle_cs

Followers
589
Following
1K
Media
10
Statuses
640

Cambridge machine learning PhD student. Formerly Amazon, @JohnsHopkins. 🏳️‍🌈 she/her

Baltimore, MD
Joined December 2017
Don't wanna be here? Send us removal request.
@annabelle_cs
Annabelle Michael Carrell
4 months
So you want to skip our thinning proofs—but you’d still like our out-of-the-box attention speedups? I’ll be presenting the Thinformer in two ICML workshop posters tomorrow! Catch me at Es-FoMo (1-2:30, East hall A) and at LCFM (10:45-11:30 & 3:30-4:30, West 202-204)
@annabelle_cs
Annabelle Michael Carrell
4 months
Your data is low-rank, so stop wasting compute! In our new paper on low-rank thinning, we share one weird trick to speed up Transformer inference, SGD training, and hypothesis testing at scale. Come by ICML poster W-1012 Tuesday at 4:30!
0
4
7
@annabelle_cs
Annabelle Michael Carrell
4 months
If you’re not at ICML, don’t worry! You can still read our work. Our new theoretically principled algorithms beat recent baselines across multiple tasks—including Transformer approximation!
Tweet card summary image
arxiv.org
The goal in thinning is to summarize a dataset using a small set of representative points. Remarkably, sub-Gaussian thinning algorithms like Kernel Halving and Compress can match the quality of...
0
2
5
@annabelle_cs
Annabelle Michael Carrell
4 months
Your data is low-rank, so stop wasting compute! In our new paper on low-rank thinning, we share one weird trick to speed up Transformer inference, SGD training, and hypothesis testing at scale. Come by ICML poster W-1012 Tuesday at 4:30!
@fhuszar
Ferenc Huszár
4 months
At ICML this week? Check out @annabelle_cs's paper in collaboration with @LesterMackey and colleagues on Low-Rank Thinning! ⏰ Tue 15 Jul 4:30 - 7 p.m. PDT New theory, dataset compression, efficient attention and more:
3
3
29
@OjewaleV
Ojewale Victor
3 years
GRAD SCHOOL APPLICATION(2.0) 🧵 Got multiple fully funded PhD offers recently and realized from conversations I have been having that many people don't approach the application process intentionally. Sharing my application process doc as an example below. Open and Retweet 🔃
4
10
24
@XinyiChen2
Xinyi Chen
3 years
Optimizer tuning can be manual and resource-intensive. Can we learn the best optimizer automatically with guarantees? With @HazanPrinceton, we give new provable methods for learning optimizers using a control approach. Excited about this result! https://t.co/GTpNSdcQlm (1/n)
2
14
150
@AshokCutkosky
Ashok Cutkosky
3 years
Neural networks are non-convex, and non-smooth. Unfortunately, most theoretical analysis is either convex, or smooth. Should we abandon the past? No! With @bremen79 and @n0royalroad, we import prior know-how via an *online to non-convex* conversion: https://t.co/EzbaXO8dtT.
Tweet card summary image
arxiv.org
We present new algorithms for optimizing non-smooth, non-convex stochastic objectives based on a novel analysis technique. This improves the current best-known complexity for finding a...
5
30
197
@PreetumNakkiran
Preetum Nakkiran
3 years
My favorite non-ML paper I read this year is probably "Bayesian Persuasion" (2011), which I somehow only found out about recently. Simple & beautiful. The first 2 pages are sufficient to be persuaded. https://t.co/HwYfiCmN5u
8
141
1K
@Michael_J_Black
Michael Black
3 years
In the LLM-science discussion, I see a common misconception that science is a thing you do and that writing about it is separate and can be automated. I’ve written over 300 scientific papers and can assure you that science writing can’t be separated from science doing. Why? 1/18
38
468
2K
@SuryaGanguli
Surya Ganguli
3 years
1/Is scale all you need for AGI?(unlikely).But our new paper "Beyond neural scaling laws:beating power law scaling via data pruning" shows how to achieve much superior exponential decay of error with dataset size rather than slow power law neural scaling https://t.co/Vn62UJXGTd
9
152
853
@sp_monte_carlo
Sam Power
3 years
cool uploads: https://t.co/82lFgyBDrD 'Understanding Linchpin Variables in Markov Chain Monte Carlo' - Dootika Vats, Felipe Acosta, Mark L. Huber, Galin L. Jones
1
4
23
@annabelle_cs
Annabelle Michael Carrell
3 years
Lots of prior work is unified by our perspective: techniques which decrease the generalization gap (more data, augmentation, regularization, etc.) also improve calibration. Excited by these results? Stay tuned! A full version of our work is coming soon to an arXiv near you. :)
1
0
11
@annabelle_cs
Annabelle Michael Carrell
3 years
For example, here we track the calibration of a single model as it trains. Early in training, the test error is high, but both test ECE & train ECE are remarkably low! Later, as the error-generalization gap grows, so too does the test ECE.
1
0
8
@annabelle_cs
Annabelle Michael Carrell
3 years
Claims 1 & 2 imply test calibration is ~upper-bounded by the error generalization gap. Models with small generalization gap have small test calibration error. Conclusion: underparameterized models (even low-accuracy ones!) are well-calibrated, while interpolating models aren’t.
1
1
9
@annabelle_cs
Annabelle Michael Carrell
3 years
We introduce two claims. 1) “Most” DNNs are well-calibrated at train time. 2) The calibration generalization gap, the difference between test and train calibration, is upper-bounded by the error generalization gap. We verify our claims empirically across arch, opt, etc.
1
2
6
@annabelle_cs
Annabelle Michael Carrell
3 years
To address this, we propose studying calibration by decomposing it into two terms: an optimization quantity (Train Calibration) and a generalization quantity (the Calibration Generalization Gap). This mirrors the fundamental decomposition of generalization theory.
1
0
8
@annabelle_cs
Annabelle Michael Carrell
3 years
New ICML workshop paper 🚨 https://t.co/5wR1ZqbOyj. Are deep neural nets calibrated? The literature is conflicted... bc the question itself has changed over time: as architectures, optimizers, and datasets evolve, it is difficult to disentangle factors which affect calibration.
5
42
228
@AnimaAnandkumar
Prof. Anima Anandkumar
3 years
We prove open problem that Thompson sampling has optimal regret for linear quadratic control in any dimension. Previously only proven in one dimension. We develop novel lower bound on probability that TS gives an optimistic sample. @SahinLale @tkargin_ @Azizzadenesheli @caltech
@Arxiv_Daily
arXiv Daily
3 years
Thompson Sampling Achieves Õ(√(T)) Regret in Linear Quadratic Control https://t.co/sHMm3SS713 by @tkargin_ et al. including @SahinLale, @AnimaAnandkumar #Probability #ThompsonSampling
0
10
73
@konstmish
Konstantin Mishchenko
3 years
Five years ago, I started my first optimization project, which was about asynchronous gradient descent. Today, I'm happy to present our new work (with @BachFrancis, M. Even and B. Woodworth) where we finally prove: Delays do not matter. https://t.co/5mO5ozEeDj 🧵1/5
2
43
317
@PetarV_93
Petar Veličković
3 years
Proud to share our CLRS benchmark: probing GNNs to execute 30 diverse algorithms! ⚡️ https://t.co/jgynKK5XrN https://t.co/frCl11JpkW (@icmlconf'22) Find out all about our 2-year effort below! 🧵 w/ Adrià @davidmbudden @rpascanu @AndreaBanino Misha @RaiaHadsell @BlundellCharles
3
54
267
@aminkarbasi
Amin Karbasi
4 years
Gradient Descent provably generalizes. I should say that our thinking was shaped and influenced by the amazing work done by the one and only @DimitrisPapail, the amazing couple @roydanroy and @gkdziugaite and of course @neu_rips, @mraginsky, @mrtz, @beenwrekt
@DKalogerias
Dionysis Kalogerias
4 years
Does full-batch Gradient Descent (GD) generalize efficiently? We provide a rather positive answer for smooth, possibly non-Lipschitz losses. Check our paper today at https://t.co/vlrXz0XyZy. With @aminkarbasi, and our amazing postdocs Kostas Nikolakakis and @Farzinhaddadpou 1/n
0
8
46