Annabelle Michael Carrell @annabelle_cs X Profile

Annabelle Michael Carrell

@annabelle_cs

Followers

589

Following

1K

Media

10

Statuses

640

Cambridge machine learning PhD student. Formerly Amazon, @JohnsHopkins. 🏳️‍🌈 she/her

Baltimore, MD

Joined December 2017

Don't wanna be here? Send us removal request.

Annabelle Michael Carrell

@annabelle_cs

4 months

So you want to skip our thinning proofs—but you’d still like our out-of-the-box attention speedups? I’ll be presenting the Thinformer in two ICML workshop posters tomorrow! Catch me at Es-FoMo (1-2:30, East hall A) and at LCFM (10:45-11:30 & 3:30-4:30, West 202-204)

Annabelle Michael Carrell

@annabelle_cs

4 months

Your data is low-rank, so stop wasting compute! In our new paper on low-rank thinning, we share one weird trick to speed up Transformer inference, SGD training, and hypothesis testing at scale. Come by ICML poster W-1012 Tuesday at 4:30!

0

4

7

Annabelle Michael Carrell

@annabelle_cs

4 months

If you’re not at ICML, don’t worry! You can still read our work. Our new theoretically principled algorithms beat recent baselines across multiple tasks—including Transformer approximation!

arxiv.org

The goal in thinning is to summarize a dataset using a small set of representative points. Remarkably, sub-Gaussian thinning algorithms like Kernel Halving and Compress can match the quality of...

0

2

5

Annabelle Michael Carrell

@annabelle_cs

4 months

Your data is low-rank, so stop wasting compute! In our new paper on low-rank thinning, we share one weird trick to speed up Transformer inference, SGD training, and hypothesis testing at scale. Come by ICML poster W-1012 Tuesday at 4:30!

Ferenc Huszár

@fhuszar

4 months

At ICML this week? Check out @annabelle_cs's paper in collaboration with @LesterMackey and colleagues on Low-Rank Thinning! ⏰ Tue 15 Jul 4:30 - 7 p.m. PDT New theory, dataset compression, efficient attention and more:

3

29

Ojewale Victor

@OjewaleV

3 years

GRAD SCHOOL APPLICATION(2.0) 🧵 Got multiple fully funded PhD offers recently and realized from conversations I have been having that many people don't approach the application process intentionally. Sharing my application process doc as an example below. Open and Retweet 🔃

4

10

24

Xinyi Chen

@XinyiChen2

3 years

Optimizer tuning can be manual and resource-intensive. Can we learn the best optimizer automatically with guarantees? With @HazanPrinceton, we give new provable methods for learning optimizers using a control approach. Excited about this result! https://t.co/GTpNSdcQlm (1/n)

2

14

150

Ashok Cutkosky

@AshokCutkosky

3 years

Neural networks are non-convex, and non-smooth. Unfortunately, most theoretical analysis is either convex, or smooth. Should we abandon the past? No! With @bremen79 and @n0royalroad, we import prior know-how via an *online to non-convex* conversion: https://t.co/EzbaXO8dtT.

arxiv.org

We present new algorithms for optimizing non-smooth, non-convex stochastic objectives based on a novel analysis technique. This improves the current best-known complexity for finding a...

5

30

197

Preetum Nakkiran

@PreetumNakkiran

3 years

My favorite non-ML paper I read this year is probably "Bayesian Persuasion" (2011), which I somehow only found out about recently. Simple & beautiful. The first 2 pages are sufficient to be persuaded. https://t.co/HwYfiCmN5u

8

141

1K

Michael Black

@Michael_J_Black

3 years

In the LLM-science discussion, I see a common misconception that science is a thing you do and that writing about it is separate and can be automated. I’ve written over 300 scientific papers and can assure you that science writing can’t be separated from science doing. Why? 1/18

38

468

2K

Surya Ganguli

@SuryaGanguli

3 years

1/Is scale all you need for AGI?(unlikely).But our new paper "Beyond neural scaling laws:beating power law scaling via data pruning" shows how to achieve much superior exponential decay of error with dataset size rather than slow power law neural scaling https://t.co/Vn62UJXGTd

9

152

853

Sam Power

@sp_monte_carlo

3 years

cool uploads: https://t.co/82lFgyBDrD 'Understanding Linchpin Variables in Markov Chain Monte Carlo' - Dootika Vats, Felipe Acosta, Mark L. Huber, Galin L. Jones

1

4

23

Annabelle Michael Carrell

@annabelle_cs

3 years

Lots of prior work is unified by our perspective: techniques which decrease the generalization gap (more data, augmentation, regularization, etc.) also improve calibration. Excited by these results? Stay tuned! A full version of our work is coming soon to an arXiv near you. :)

1

0

11

Annabelle Michael Carrell

@annabelle_cs

3 years

For example, here we track the calibration of a single model as it trains. Early in training, the test error is high, but both test ECE & train ECE are remarkably low! Later, as the error-generalization gap grows, so too does the test ECE.

1

0

8

Annabelle Michael Carrell

@annabelle_cs

3 years

Claims 1 & 2 imply test calibration is ~upper-bounded by the error generalization gap. Models with small generalization gap have small test calibration error. Conclusion: underparameterized models (even low-accuracy ones!) are well-calibrated, while interpolating models aren’t.

1

9

Annabelle Michael Carrell

@annabelle_cs

3 years

We introduce two claims. 1) “Most” DNNs are well-calibrated at train time. 2) The calibration generalization gap, the difference between test and train calibration, is upper-bounded by the error generalization gap. We verify our claims empirically across arch, opt, etc.

1

2

6

Annabelle Michael Carrell

@annabelle_cs

3 years

To address this, we propose studying calibration by decomposing it into two terms: an optimization quantity (Train Calibration) and a generalization quantity (the Calibration Generalization Gap). This mirrors the fundamental decomposition of generalization theory.

1

0

8

Annabelle Michael Carrell

@annabelle_cs

3 years

New ICML workshop paper 🚨 https://t.co/5wR1ZqbOyj. Are deep neural nets calibrated? The literature is conflicted... bc the question itself has changed over time: as architectures, optimizers, and datasets evolve, it is difficult to disentangle factors which affect calibration.

5

42

228

Prof. Anima Anandkumar

@AnimaAnandkumar

3 years

We prove open problem that Thompson sampling has optimal regret for linear quadratic control in any dimension. Previously only proven in one dimension. We develop novel lower bound on probability that TS gives an optimistic sample. @SahinLale @tkargin_ @Azizzadenesheli @caltech

arXiv Daily

@Arxiv_Daily

3 years

Thompson Sampling Achieves Õ(√(T)) Regret in Linear Quadratic Control https://t.co/sHMm3SS713 by @tkargin_ et al. including @SahinLale, @AnimaAnandkumar #Probability #ThompsonSampling

0

10

73

Konstantin Mishchenko

@konstmish

3 years

Five years ago, I started my first optimization project, which was about asynchronous gradient descent. Today, I'm happy to present our new work (with @BachFrancis, M. Even and B. Woodworth) where we finally prove: Delays do not matter. https://t.co/5mO5ozEeDj 🧵1/5

2

43

317

Petar Veličković

@PetarV_93

3 years

Proud to share our CLRS benchmark: probing GNNs to execute 30 diverse algorithms! ⚡️ https://t.co/jgynKK5XrN https://t.co/frCl11JpkW (@icmlconf'22) Find out all about our 2-year effort below! 🧵 w/ Adrià @davidmbudden @rpascanu @AndreaBanino Misha @RaiaHadsell @BlundellCharles

3

54

267

Amin Karbasi

@aminkarbasi

4 years

Gradient Descent provably generalizes. I should say that our thinking was shaped and influenced by the amazing work done by the one and only @DimitrisPapail, the amazing couple @roydanroy and @gkdziugaite and of course @neu_rips, @mraginsky, @mrtz, @beenwrekt

Dionysis Kalogerias

@DKalogerias

4 years

Does full-batch Gradient Descent (GD) generalize efficiently? We provide a rather positive answer for smooth, possibly non-Lipschitz losses. Check our paper today at https://t.co/vlrXz0XyZy. With @aminkarbasi, and our amazing postdocs Kostas Nikolakakis and @Farzinhaddadpou 1/n

0

8

46