Mufan (Bill) Li @mufan_li profile

Mufan (Bill) Li

@mufan_li

Followers

855

Following

496

Media

78

Statuses

383

Postdoc @Princeton ORFE | Prev: PhD @UofTStatSci @VectorInst

https://t.co/8gfwGInItX

Toronto, Ontario

Joined March 2014

Don't wanna be here? Send us removal request.

Explore tweets Explore followers Explore following

Explore trending content on Musk Viewer

George Floyd • 268575 Tweets

Toni Kroos • 267619 Tweets

#GMMTVOuting2024 • 156968 Tweets

Gaga • 106139 Tweets

ダービー • 91488 Tweets

#precure • 71379 Tweets

Libertarians • 62484 Tweets

コミティア • 54698 Tweets

#ブンブンジャー • 37241 Tweets

Karoline • 33630 Tweets

#仮面ライダーガッチャード • 29369 Tweets

May Special • 29282 Tweets

キュアニャミー • 25542 Tweets

プリキュア • 23380 Tweets

OUTING 2024 X GEMINI FOURTH • 21677 Tweets

ジャスティンミラノ • 20905 Tweets

設営完了 • 20344 Tweets

Millonarios • 18830 Tweets

ホッパー1 • 17060 Tweets

ブンバイオレット • 16398 Tweets

Medina • 14136 Tweets

レガレイラ • 14033 Tweets

東京競馬場 • 13958 Tweets

Saint MSG Insan • 12610 Tweets

#超超超超ゲーマーズday2 • 12274 Tweets

Ross Ulbricht • 11675 Tweets

ユキちゃん • 11284 Tweets

まゆちゃん • 10617 Tweets

Thiago Alcântara

ぜいたく品

Bucaramanga

Amaya

Hauser

Taborda

サファリ

四間飛車

Kate Martin

シャーシロ

Derrick White

ニチアサ

Aaron Judge

Ney Matogrosso

キラーロボ

Alyssa Thomas

玄蕃さん

イーウィニャ

アンパドゥ

サンシーター

追加戦士

Nembhard

Last Seen Profiles

@sidewaysfam

@TanseemHaider

@UnderwayBeats

@Cscbalvidybilha

@CJ_SAX

@GrantHughsey

@02_kxi

@ExauceNdaka27

@Yamiku

@mr_hullabaloo

@jasminekennedie

@SmthaB

@DylHicks

@Musclemania

@Arleyobi

@regina_humbles

@GrettV

@CMR_R5

@moltib

@VenfreyShi

Pinned Tweet

Mufan (Bill) Li

@mufan_li

1 month

I’m excited to announce that in July 2025 I will be joining @UWaterloo as an Assistant Professor in the Department of Statistics and Actuarial Science! Until then, I will continue at Princeton as a DataX Postdoc Fellow, working with Boris Hanin. I have many exciting projects

27

5

218

Mufan (Bill) Li

@mufan_li

2 years

Infinite-width limits don't behave like real networks in many ways. In new work, we identify the "Neural Covariance SDE" underlying the infinite-DEPTH-and-width limit and provide more evidence that this limit better matches the properties of real networks

6

57

360

Mufan (Bill) Li

@mufan_li

3 years

I wrote a blog post on a gem hidden in an 80 page paper that nobody has time to read or interpret, which imo, is actually which most interesting result. TL;DR: a new technique to establish Poincare inequalities, with some interesting consequences

1

17

112

Mufan (Bill) Li

@mufan_li

3 years

There has been a flurry of work beyond the infinite-width limit. We study the infinite DEPTH-AND-WIDTH limit of ReLU nets with residual connections and see remarkable (!) agreement with STANDARD finite networks. Joint work w/ @MihaiCNica @roydanroy

4

19

110

Mufan (Bill) Li

@mufan_li

2 years

What is the complexity of sampling using Langevin Monte Carlo (LMC) under a Poincaré inequality? We provide the first answer to this open problem. Joint work with Sinho Chewi, @MuratAErdogdu , Ruoqi Shen, and Matthew Zhang

2

15

91

Mufan (Bill) Li

@mufan_li

6 months

@rasbt @liranringel It turns out that stacking non-linearities in deep networks naively is the core reason causing unstable gradients. Shaping them at a precise size dependent rate is the key to extending the network to arbitrary depth. See eg

The Shaped Transformer: Attention Models in the Infinite...

In deep learning theory, the covariance matrix of the representations serves as a proxy to examine the network's trainability. Motivated by the success of Transformers, we study the covariance...

arxiv.org

1

4

41

Mufan (Bill) Li

@mufan_li

2 years

A really nice paper studying the Riemannian structure of deep linear networks, in particular the SVD coordinates leads to a very clean formula for the volume form

0

5

36

Mufan (Bill) Li

@mufan_li

2 years

Come visit our NeurIPS poster tonight at 7:30-9pm EST to learn about the infinite-DEPTH-AND-WIDTH limit of ResNets!

Mufan (Bill) Li

@mufan_li

3 years

There has been a flurry of work beyond the infinite-width limit. We study the infinite DEPTH-AND-WIDTH limit of ReLU nets with residual connections and see remarkable (!) agreement with STANDARD finite networks. Joint work w/ @MihaiCNica @roydanroy

4

19

110

0

9

30

Mufan (Bill) Li

@mufan_li

2 years

@ccanonne_ Bakry Gentil and Ledoux started their book with this

1

25

Mufan (Bill) Li

@mufan_li

3 years

@karen_ullrich @y0b1byte @BahareFatemi I think "systemic" is key here. From my experience talking to other grad students, essentially everyone admits they have the same problems, even the ones that are doing well in terms of publication. There's definitely something wrong with the environment.

1

0

22

Mufan (Bill) Li

@mufan_li

3 years

@wellingmax @roydanroy

0

3

22

Mufan (Bill) Li

@mufan_li

3 years

@ccanonne_ Someone decided to troll

2

1

20

Mufan (Bill) Li

@mufan_li

2 years

Martens et al. and Zhang et al. recently proposed to shape the activation to be more like the identity. But how should the shape depend on the size of the network? There’s a very intuitive answer actually: the rate that makes the covariance Markov chain converge to an SDE!

1

3

20

Mufan (Bill) Li

@mufan_li

2 years

@shortstein @icmlconf At the same time, I have also received reviews claiming my main theorems are wrong without further explanations 🤷‍♂️🤷‍♂️

0

17

Mufan (Bill) Li

@mufan_li

4 years

@mraginsky So much better with the template

1

0

15

Mufan (Bill) Li

@mufan_li

2 months

@sirbayes To be fair, we have no idea why neural networks can provide good estimates of the score function. Arguably this is as deep a mystery as transformers

2

0

14

Mufan (Bill) Li

@mufan_li

3 years

@sam_power_825 From talking to a former researcher on ADAM, the continuous time ODE system was difficult to work with compared to other simpler algorithms. Even in the convex case, it was non-trivial to construct a Lyapunov function

A general system of differential equations to model first order...

First order optimization algorithms play a major role in large scale machine learning. A new class of methods, called adaptive algorithms, were recently introduced to adjust iteratively the...

arxiv.org

0

14

Mufan (Bill) Li

@mufan_li

6 months

@sp_monte_carlo Currently doing a reading group on this set of notes and just finished Hörmander’s. Super grateful that Sinho suggested we start from Eldredge first, things make a lot more sense here

Analysis and Probability on Infinite-Dimensional Spaces

These lecture notes contain an introduction to some of the fundamental ideas and results in analysis and probability on infinite-dimensional spaces, mainly Gaussian measures on Banach spaces. They...

arxiv.org

1

0

14

Mufan (Bill) Li

@mufan_li

6 years

A new blog post on a clever Stone-Weierstrass based technique, illustrated through an alternative proof of Itô's Lemma. This post consumed far too much effort to write, so hopefully it's at least of mediocre quality.

Stone-Weierstrass and an Alternative Proof of Itô’s Lemma

In a similar sense to line integrals, stochastic calculus extends the classical tools to working with stochastic processes. One of the most elegant and useful result is the change of variable formula...

mufan-li.github.io

1

3

13

Mufan (Bill) Li

@mufan_li

4 years

@Austen Have you tried teaching intro stats?

1

0

13

Mufan (Bill) Li

@mufan_li

5 years

@thesasho @radcummings @markmbun Every time I use Jensen, I check the direction of E(X^2) >= (EX)^2 and confirm the difference is variance

0

13

Mufan (Bill) Li

@mufan_li

3 years

@roydanroy @elnazavr 4. To understand @PHDcomics @AcademicsSay @MemingPhD

2

0

12

Mufan (Bill) Li

@mufan_li

1 year

@jasondeanlee Jason Miller’s notes is very short

1

12

Mufan (Bill) Li

@mufan_li

4 years

@roydanroy @ilyasut By this standard, every random algorithm performs Bayesian inference.

2

0

11

Mufan (Bill) Li

@mufan_li

1 year

Many great mathematical advances are also just expert usage of integration-by-parts and Cauchy--Schwarz, which seems to be familiar and unimportant to most people. Bakry, Gentil, and Ledoux certainly agrees with their book dedication.

Ilya Sutskever

@ilyasut

1 year

Many believe that great AI advances must contain a new “idea”. But it is not so: many of AI’s greatest advances had the form “huh, turns out this familiar unimportant idea, when done right, is downright incredible”

42

208

2K

1

0

11

Mufan (Bill) Li

@mufan_li

4 years

@optiML Had a similar experience for a NeurIPS submission. Reviewer made blatantly false claims, I complained to AC, then AC sided with reviewer.

0

1

10

Mufan (Bill) Li

@mufan_li

2 years

@mraginsky @AlexGDimakis All these papers on scaling laws are hinting that large networks are probably converging. And if we learned anything from statistical physics, it’s probably that we should try to describe the limit instead of banging our heads against the wall with finite size problems

1

0

10

Mufan (Bill) Li

@mufan_li

2 years

Congratulations Jeff! Very well deserved and I’m so happy for you!

Jeffrey Negrea

@jeffNegrea

2 years

I'm pleased to announce that I've accepted a faculty pos'n in the dept. of statistics and actuarial science at my alma mater, @WaterlooMath , and a faculty affiliate pos'n @VectorInst , starting summer 2023! Meanwhile, I will be a postdoc @DSI_UChicago , starting September!

33

8

264

1

10

Mufan (Bill) Li

@mufan_li

2 years

The above plot was actually the squared magnitude of post activations in a ReLU network, which forms a random walk converging to a geometric Brownian motion! This recovers known log-Gaussian results, but we also get the nice interpretation of depth-to-width ratio as time.

2

1

9

Mufan (Bill) Li

@mufan_li

3 years

@sam_power_825 I think I saw this one on mathematical mathematics memes

0

9

Mufan (Bill) Li

@mufan_li

3 years

Is there any experts in trignometric series out there that would like a challenge? I would offer to buy beer for any solutions offered

Computing a Closed Form for an Infinite Series Involving ArcSins

Specifically, I would like compute the following in closed form for some $\alpha \in [0,1]$ $$ \sum_{\ell \geq 1} \arcsin( \alpha^\ell ) \left( 1 + 2 \alpha^{2\ell} \right) \,. $$ I have tried usi...

math.stackexchange.com

3

0

8

Mufan (Bill) Li

@mufan_li

3 years

@roydanroy @thegautamkamath I remember Kevin tried to sell nonstandard analysis to first year PhD students with “you get to use cool words like ‘ultrafilters’ and ‘hyperfinite’“ and yeah it didn’t work

1

0

8

Mufan (Bill) Li

@mufan_li

3 years

@ccanonne_ I will be forever amazed at this sequence of implications between inequalities, beautifully summarized by Villani in "Optimal Transport, Old and New".

1

0

8

Mufan (Bill) Li

@mufan_li

3 years

You ever seen a proof SO CLEAN that you just can't help but write it up? I encountered one such proof using Doob's h-transform to show Dyson Brownian motion are independent Brownian motions conditioned to not intersect. Shortest post so far!

0

1

8

Mufan (Bill) Li

@mufan_li

6 months

This perfectly describes my experience in academia, certainly wasn’t thinking of OpenAI when I was reading

Laura Wendel

@Lauramaywendel

6 months

A lot of very smart people work in strange ways / with a lot of quirks (e.g. contemplating for hours and appearing to do nothing, while then suddenly having 100x output burst). This usually makes them not a great fit for traditional corporate world, where you often have to fake

238

2K

9K

0

8

Mufan (Bill) Li

@mufan_li

9 months

@sp_monte_carlo If you can write the two laws as time marginals of two diffusion processes with the same diffusion coefficient, then the KL can be upper bounded by the KL of the path measures, which admits a closed form via Girsanov. See eg

Analysis of Langevin Monte Carlo from Poincaré to Log-Sobolev

Classically, the continuous-time Langevin diffusion converges exponentially fast to its stationary distribution $π$ under the sole assumption that $π$ satisfies a Poincaré inequality. Using...

arxiv.org

0

8

Mufan (Bill) Li

@mufan_li

3 years

We believe this result can be extended to non-Gaussian weights. See an earlier universality result for MLPs by @BorisHanin and @MihaiCNica

Products of Many Large Random Matrices and Gradients in Deep...

We study products of random matrices in the regime where the number of terms and the size of the matrices simultaneously tend to infinity. Our main theorem is that the logarithm of the $\ell_2$...

arxiv.org

0

7

Mufan (Bill) Li

@mufan_li

2 years

Joint work with @MihaiCNica @roydanroy - thread below

1

7

Mufan (Bill) Li

@mufan_li

4 years

@roydanroy @it_meirl_bot @dog_feelings @evilbmcats @InternetHippo

1

0

7

Mufan (Bill) Li

@mufan_li

3 years

@cjmaddison @roydanroy @thegautamkamath This reminds me of Michel Talagrand writing a new book to make his old book "obsolete"

0

7

Mufan (Bill) Li

@mufan_li

3 years

@sam_power_825 Random sorting networks also has a scaling limit such that each trajectory become sinusoidal. I don’t think Duncan is on Twitter but he does really good work in probability theory, this is from his thesis

1

7

Mufan (Bill) Li

@mufan_li

3 years

As a result, the infinite depth-and-width limit is not Gaussian. This work extends results for fully connected MLPs where the analysis is much simpler. See @MihaiCNica ’s youtube video for an introduction.

Intro to Infinite Depth-and-Width Limits of Deep Neural Nets on...

Link to notes: https://sites.uoguelph.ca/nicam/files/2021/06/Intro_to_Infinite_Depth_and_Width.pdfLinks to references:[LNR21] (Infinite Depth-and-Width limit...

www.youtube.com

2

0

7

Mufan (Bill) Li

@mufan_li

2 years

You might be wondering: where does Brownian motion even come from in a neural network? Well, a central limit theorem of sorts - for random walks! If you reduce the size of each RW step and take more steps, you eventually get a Brownian motion!

2

0

7

Mufan (Bill) Li

@mufan_li

2 months

@sp_monte_carlo On a related note, I never understood the point of drawing any diagrams for neural network architectures. The transformer diagram was particularly confusing for me. Just write it out in matrix notation. It’s like two lines and much more clear.

1

0

7

Mufan (Bill) Li

@mufan_li

2 years

First paper here is easily the most underrated paper of the year. It’s the first algorithm independent sampling lower bound, and gets an unexpected log log dependence on the condition number.

Sam Power

@sp_monte_carlo

2 years

‘The query complexity of sampling from strongly log-concave distributions in one dimension’ ‘Rejection sampling from shape-constrained distributions in sublinear time’

1

0

3

1

0

7

Mufan (Bill) Li

@mufan_li

9 months

@deepcohen There is also a strange cognitive dissonance that somehow adding experiments to a theory paper hurts the original paper. Surely some experiments are better than none, if no strong claims are being made with them?

3

0

7

Mufan (Bill) Li

@mufan_li

11 months

Infinite depth transformers! It's only @ChuningLi 's first paper, and I wish my first paper was this good!

Lorenzo Noci

@lorenzo_noci

11 months

How do you scale Transformers to infinite depth while ensuring numerical stability? In fact, LayerNorm is not enough. But *shaping* the attention mechanism works! w/ @ChuningLi @mufan_li @bobby_he @THofmann2017 @cjmaddison @roydanroy

6

35

217

0

7

Mufan (Bill) Li

@mufan_li

2 years

If you have any questions, or are interested in the many exciting directions this work opens up, feel free to reach out! I would be happy to chat :)

0

6

Mufan (Bill) Li

@mufan_li

2 years

For practitioners using shaping to improve training: beware the correlation distribution is pretty heavily skewed! Shaping with infinite-width theory is not quite sufficient to fully prevent degeneracy.

1

0

6

Mufan (Bill) Li

@mufan_li

2 years

You might already realize what’s going to happen: identify the covariance Markov chain and derive the SDE. However there’s just a little problem, the usual covariance Markov chain is more like a recursion (observe the f(Y) here) and it doesn’t converge to an SDE!

1

6

Mufan (Bill) Li

@mufan_li

2 years

@blairbilodeau Bro it’s obviously 50% because it either happens or it doesn’t

1

0

6

Mufan (Bill) Li

@mufan_li

2 years

@sp_monte_carlo It's uniform on a fractal set, but not sure if there is a lot you can say based on this alone

Fractal Structure and Generalization Properties of Stochastic...

Understanding generalization in deep learning has been one of the major challenges in statistical learning theory over the last decade. While recent work has illustrated that the dataset and the...

arxiv.org

1

0

6

Mufan (Bill) Li

@mufan_li

4 years

@sam_power_825 @roydanroy @junpenglao @colindcarroll Suppose your gradient is not Lipschitz, for example f(x) = x^4, even gradient descent diverges to infinity. Also even with ergodicity, the stationary distribution may not be nice at all. Since SGD in finite steps are finite point masses, the limit is likely a fractal like set.

1

0

6

Mufan (Bill) Li

@mufan_li

6 months

@deepcohen This is actually a positive sign when training dynamics are consistent. We find this phenomenon implies hyperparameters transfer, see eg Figure 3e and 4 here

Depthwise Hyperparameter Transfer in Residual Networks: Dynamics...

The cost of hyperparameter tuning in deep learning has been rising with model sizes, prompting practitioners to find new tuning methods using a proxy of smaller networks. One such proposal uses...

arxiv.org

0

6

Mufan (Bill) Li

@mufan_li

2 years

@Yuhu_ai_ Sometimes I wonder if deep learning is gonna solve deep learning theory before deep learning theorists do

0

6

Mufan (Bill) Li

@mufan_li

1 year

Technically the theory is not wrong - neural nets are terrible uniform learners. However, uniform learning is the wrong goal post. We do need a better theory, and perhaps fewer people driving "disbelief" prematurely.

Yann LeCun

@ylecun

1 year

When empirical evidence clashes with a theory, it is the theory that is wrong (or misinterpreted), not the universe. The well-documented initial resistance of many ML theorists to deep learning was due to complete theory-fueled disbelief: this can't possibly work!

40

90

648

0

5

Mufan (Bill) Li

@mufan_li

1 year

@miniapeur Partial differential equations. So much information and consequences contained in a single line of math, really hard to appreciate with courses alone.

1

0

5

Mufan (Bill) Li

@mufan_li

4 years

Never understood how anyone expects good work without being personally invested in the work, and as a result taking rejections personally.

Francesco Orabona

@bremen79

4 years

Dear PhD students with a submission to @NeurIPSConf , soon ~20% of you will receive a desk reject. Here some suggestions to deal with it in a healthy way. First, this is not personal: a paper written by you was rejected, not you. Keep your self-worth unlinked from your work. 1/5

10

99

612

1

0

5

Mufan (Bill) Li

@mufan_li

3 years

Our main result is a precise description of the distribution of the output of the network in the infinite depth-and-width limit. One key observation: the magnitude contains a log-Gaussian factor. The exact constants and Gaussian parameters can be found in the paper.

1

0

5

Mufan (Bill) Li

@mufan_li

3 years

How is the infinite depth-and-width limit different? In short, each layer of width (n) carries an error term of size O(1/n), and increasing depth (d) compounds the error exponentially. At the heart of the analysis is the following "dichotomy":

1

0

5

Mufan (Bill) Li

@mufan_li

5 years

@mahonylab

0

5

Mufan (Bill) Li

@mufan_li

2 years

@sp_monte_carlo Gradient descent with epsilon noise can escape saddle points exponentially fast

0

1

5

Mufan (Bill) Li

@mufan_li

2 years

To summarize, I want to observe that the complexity of sampling is captured by functional inequalities — a fundamental property of the distribution. This presents a generic and precise framework to understanding what makes sampling hard.

0

5

Mufan (Bill) Li

@mufan_li

2 years

@OmarRivasplata In Markov diffusions, it’s more natural to interpret Q as the target or base measure. If P_t is the law of the diffusion at time t and Q is the stationary measure, then KL(P_t|Q) converges to zero under a log-Sobolev inequality

0

1

5

Mufan (Bill) Li

@mufan_li

5 years

@NAChristakis @sinanaral @TechCrunch > a glorified curve fitter: overfits data > media: omg this AI is outsmarting humans

0

2

5

Mufan (Bill) Li

@mufan_li

2 years

For theorists, we also show that our proposed scaling is critical: if the shape converges to identity too slow, we get a degenerate limit; If too fast, we get the same limit as linear networks! See Proposition 3.4 and 3.10.

1

4

Mufan (Bill) Li

@mufan_li

3 years

@roydanroy Last time I was honest about an issue at work during an AMA with the CEO, I pissed off some senior management and got a talk from boss. So I think the comments here are extremely biased. Anonymous poll?

2

0

4

Mufan (Bill) Li

@mufan_li

4 years

@roydanroy @togelius Since the proof of the Poincaré conjecture doesn't fit in 6-8 pages, then Perelman and later authors should have just wrote 100 papers instead. Duh.

0

4

Mufan (Bill) Li

@mufan_li

3 years

However, networks with skip connections introduce correlations between layers, which complicates the analysis. Surprising observation: with residual connections, the population of neurons is HYPOactivated, i.e., fewer than half of the ReLU units are active.

1

0

4

Mufan (Bill) Li

@mufan_li

3 years

So we titled the paper "the future is log-Gaussian:..." just for the memes

Mufan (Bill) Li

@mufan_li

3 years

There has been a flurry of work beyond the infinite-width limit. We study the infinite DEPTH-AND-WIDTH limit of ReLU nets with residual connections and see remarkable (!) agreement with STANDARD finite networks. Joint work w/ @MihaiCNica @roydanroy

4

19

110

0

4

Mufan (Bill) Li

@mufan_li

1 year

@roydanroy @pfau @gaurav_ven @jamesgiammona I want to say in this case, it is still strange. The reason is because before taking any limits, the neurons form a Gaussian process conditional on the previous layer. For this Gaussian structure to go away, we would need the kernel to be non-constant in the large sample limit

1

0

4

Mufan (Bill) Li

@mufan_li

2 years

@PreetumNakkiran You can write all your series expansions in terms of trees so nobody can read them

2

0

4

Mufan (Bill) Li

@mufan_li

3 years

@thegautamkamath @roydanroy @mraginsky @gautamcgoel I'm just gonna leave this here.

Mario Krenn

@MarioKrenn6240

4 years

The number of monthly new ML +AI papers at arXiv seems to grow exponentially, with a doubling rate of 23months. Probably will lead to problems for publishing in these fields, at some point.

43

142

759

2

0

4

Mufan (Bill) Li

@mufan_li

3 years

In case you actually want to sort through the original paper

Riemannian Langevin Algorithm for Solving Semidefinite Programs

We propose a Langevin diffusion-based algorithm for non-convex optimization and sampling on a product manifold of spheres. Under a logarithmic Sobolev inequality, we establish a guarantee for...

arxiv.org

0

4

Mufan (Bill) Li

@mufan_li

4 years

@arvi My biggest complaints are (1) overleaf tends to "fix" syntax errors without telling me what it did, so often it doesn't compile offline (2) UI tends to be slow, especially the folder organization part I still use it tho

1

0

4

Mufan (Bill) Li

@mufan_li

1 year

@IsomorphicPhi Brezis is really nice. Evans is more like a reference text, not a good place to learn from imo. If you really want to get interested in PDEs though, an application is probably important. For me that was the beautiful connection to SDEs

1

0

4

Mufan (Bill) Li

@mufan_li

2 years

Now you might ask, what’s so special about the Poincaré inequality? Indeed, it is not. The Latala–Oleszkiewicz (LO) inequality interpolates between Poincaré (alpha=1) and log-Sobolev (alpha=2).

2

0

4

Mufan (Bill) Li

@mufan_li

2 years

@sp_monte_carlo I don't know why it's not more celebrated but this simple construction of the modified equation (aka backward error analysis) in Hairer, Lubich, Wanner always blew my mind

1

0

4

Mufan (Bill) Li

@mufan_li

4 years

Dan Roy

@roydanroy

4 years

@hardmaru @NeurIPSConf Wait... am I going to have to use space on the ethical/societal implications of my generalization bound?

8

1

45

0

1

4

Mufan (Bill) Li

@mufan_li

11 months

@shortstein They are related by the OU-generator, where the stationary Poisson equation is exactly the second order PDE version of Stein’s equation, and LSI also corresponds to the same generator.

0

4

Mufan (Bill) Li

@mufan_li

2 years

Additionally, for smooth activations, we also show that the limiting SDE can have finite time explosions! This depends on the choice of how the shaped activation is centered, and we provided an if-and-only condition for choosing such a center.

1

4

Mufan (Bill) Li

@mufan_li

2 years

@ilyaraz2 Caffeine and sugar:

0

4

Mufan (Bill) Li

@mufan_li

3 years

@cjmaddison @sam_power_825 Some of the craziest notation I have ever seen came out as an extension of Butcher series - these are Runge-Kutta order conditions for SDE weak error

1

0

4

Mufan (Bill) Li

@mufan_li

3 years

@sam_power_825 Why not both? Sometimes it’s nice to just write d\mu for short hand notation, but for a kernel or marginal of a joint, it’s more clear to writer K(x, dy)

2

0

4

Mufan (Bill) Li

@mufan_li

1 year

A really nice result!

Soufiane Hayou

@hayou_soufiane

1 year

Q: What happens to the neural covariance when both Width and Depth are taken to infinity? A: it **depends** how you take that limit. However, for ResNets, we show (joint work with @TheGregYang ) that you always get the same covariance structure.. Link:

3

29

154

0

4

Mufan (Bill) Li

@mufan_li

2 years

More generally, appropriately scaled Markov chains converge to solutions of SDEs - this the main technique we use in this work.

1

0

4

Mufan (Bill) Li

@mufan_li

3 months

@ccanonne_ Yes, these are from an older version of Dmitry Panchenko's excellent lecture notes on probability theory. Here's the most updated version posted on his website, Strassen's Theorem is in Section 4.3

Grad-Probability.pdf

drive.google.com

1

0

4

Mufan (Bill) Li

@mufan_li

2 years

After some simplifications, our main result implies the following runtime complexity in terms of Rényi divergence under Poincaré

2

0

4

Mufan (Bill) Li

@mufan_li

11 months

Obviously this can only mean compression algorithms are sentient

Brendan O'Donoghue

@bodonoghue85

11 months

I think it's reasonable at this stage to call for a moratorium on the use of gzip and to nuke any datacenters we suspect of using illicit compression technology.

10

36

443

1

0

3

Mufan (Bill) Li

@mufan_li

4 years

@roydanroy For a lot of exercises like compound lifts, if you can’t breathe fully, then it’s hard to get the full exercise. Having to breathe through a mask in between squat/deadlift sets also feels terrible :/ I’d rather not go if I have to wear a mask

0

3

Mufan (Bill) Li

@mufan_li

1 year

@fentpot @miniapeur That’s the beauty of it. Even without an analytical solution, the PDE characterizes a ton of properties of the solution. Usefulness depends on whether a PDE naturally arises in your work.

0

3

Mufan (Bill) Li

@mufan_li

3 years

@wgrathwohl Imagine making a career out of making incremental changes to other people's methods, adding a ridiculous amount of trial and error, just to show it to a room full of other nerds like you after you beat some SOTA score with pure luck. Yes I'm talking about speed running.

0

3

Mufan (Bill) Li

@mufan_li

3 months

@Karthikvaz @sp_monte_carlo It’s helpful to think of probability distributions as manifolds, equipped with the Wasserstein metric, and the potential is KL divergence. Then LSI is the PL inequality, which is equivalent to exponential decay of KL. Strong convexity is equivalent to exponential decay of W2.

1

0

3

Mufan (Bill) Li

@mufan_li

2 years

@ccanonne_ You can reverse the order of the summation so you get 1 + 1/e + 1/e^2 + …

0

3

Mufan (Bill) Li

@mufan_li

26 days

@moskitos_bite I really enjoyed Villani’s topics in optimal transportation as an intro. Many proofs had a short special case, which is sufficient as a first read.

1

0

3

Mufan (Bill) Li

@mufan_li

4 years

POLITICO

@politico

4 years

Hillary Clinton is starting a podcast

1K

496

2K

0

3

Mufan (Bill) Li

@mufan_li

6 months

@PreetumNakkiran I actually think it’s because OT is the most natural language to study probability and stochastic processes. Wasserstein metric naturally induces a Riemannian manifold structure on the space of probability distributions, and this fact is incredibly underrated

1

0

3

Mufan (Bill) Li

@mufan_li

6 months

@jamestanton

Brouwer fixed-point theorem - Wikipedia

en.wikipedia.org

0

2

Mufan (Bill) Li