Jascha Sohl-Dickstein @jaschasd X Profile

Jascha Sohl-Dickstein

@jaschasd

Followers

23K

Following

1K

Media

75

Statuses

544

Member of the technical staff @ Anthropic. Most (in)famous for inventing diffusion models. AI + physics + neuroscience + dynamics.

San Francisco

Joined August 2009

Don't wanna be here? Send us removal request.

Jascha Sohl-Dickstein

@jaschasd

3 years

My first blog post ever! Be harsh, but, you know, constructive. Too much efficiency makes everything worse: overfitting and the strong version of Goodhart's law. 🧵

42

187

987

Jascha Sohl-Dickstein

@jaschasd

19 days

I will be attending ICML next week. Reach out (by email) if you'd like to chat! About Anthropic / research / life. I'm especially interested in meeting grad students who can teach me new research ideas.

7

10

273

Jascha Sohl-Dickstein

@jaschasd

3 months

This is great, hearing Yang's thought process and motivations for his score matching/diffusion research. (I had forgotten that I tried to convince him that score matching was too local to be useful for generative modeling :/).

Slater Stich

@slaterstich

3 months

Very excited to share our interview with @DrYangSong. This is Part 2 of our history of diffusion series — score matching, the SDE/ODE interpretation, consistency models, and more. Enjoy!

8

11

112

Jascha Sohl-Dickstein

@jaschasd

6 months

Slater is an excellent interviewer. This was a lot of fun to do. I'm even more excited for the upcoming interviews with @DrYangSong and @sedielem !.

Slater Stich

@slaterstich

6 months

Very excited to share our interview with @jaschasd on the history of diffusion models — from his original 2015 paper inventing them, to the GAN "ice age", to the resurgence in diffusion starting with DDPM. Enjoy!

1

6

72

Jascha Sohl-Dickstein

@jaschasd

1 year

This is an excellent paper, that ties many threads together around scaling models and hyperparameters.

3

55

Jascha Sohl-Dickstein

@jaschasd

1 year

This was one of the most research-enabling libraries I used at Google. If you want to try out LLM ideas with a simple, clean, JAX codebase, this is for you.

Peter J. Liu

@peterjliu

1 year

We recently open-sourced a relatively minimal implementation example of Transformer language model training in JAX, called NanoDO. If you stick to vanilla JAX components, the code is relatively straightforward to read -- the model file is <150 lines. We found it useful as a.

1

6

77

Jascha Sohl-Dickstein

@jaschasd

1 year

This was a fun project!. If you could train an LLM over text arithmetically compressed using a smaller LLM as a probabilistic model of text, it would be really good. Text would be represented with far fewer tokens, and inference would be way faster and cheaper. The hard part is.

Noah Constant

@noahconst

1 year

Ever wonder why we don’t train LLMs over highly compressed text? Turns out it’s hard to make it work. Check out our paper for some progress that we’re hoping others can build on. With @blester125, @hoonkp, @alemi, Jeffrey Pennington, @ada_rob, @jaschasd.

3

10

103

Jascha Sohl-Dickstein

@jaschasd

1 year

RT @trishume: Here's Claude 3 Haiku running at >200 tokens/s (>2x as fast as prod)! We've been working on capacity optimizations but we can….

0

36

0

Jascha Sohl-Dickstein

@jaschasd

1 year

I don't have a SoundCloud, but I did join Anthropic last week, and so far it has exceeded my (high) expectations. I would strongly recommend working there (and using Claude). *this project not done at Anthropic -- this was recreational machine learning on my own time.

12

10

323

Jascha Sohl-Dickstein

@jaschasd

1 year

Want to learn more?. Blog post: 3-page paper:

arxiv.org

Some fractals -- for instance those associated with the Mandelbrot and quadratic Julia sets -- are computed by iterating a function, and identifying the boundary between hyperparameters for which...

14

71

677

Jascha Sohl-Dickstein

@jaschasd

1 year

The best performing hyperparameters are typically at the edge of stability -- so when you optimize neural network hyperparameters, you are contending with hyperparameter landscapes that look like this.

22

418

Jascha Sohl-Dickstein

@jaschasd

1 year

So it shouldn't (post-hoc) be a surprise that hyperparameter landscapes are fractal. This is a general phenomenon: in these panes we see fractal hyperparameter landscapes for every neural network configuration I tried, including deep linear networks.

7

29

485

Jascha Sohl-Dickstein

@jaschasd

1 year

In both cases the function iteration can produce outputs that either diverge to infinity or remain happily bounded depending on those hyperparameters. Fractals are often defined by the boundary between hyperparameters where function iteration diverges or remains bounded.

2

3

243

Jascha Sohl-Dickstein

@jaschasd

1 year

There are similarities between the way in which many fractals are generated, and the way in which we train neural networks. Both involve repeatedly applying a function to its own output. In both cases, that function has hyperparameters that control its behavior.

5

13

372

Jascha Sohl-Dickstein

@jaschasd

1 year

The boundary between trainable and untrainable neural network hyperparameter configurations is *fractal*! And beautiful!. Here is a grid search over a different pair of hyperparameters -- this time learning rate and the mean of the parameter initialization distribution.

24

167

1K

Jascha Sohl-Dickstein

@jaschasd

1 year

Have you ever done a dense grid search over neural network hyperparameters? Like a *really dense* grid search? It looks like this (!!). Blueish colors correspond to hyperparameters for which training converges, redish colors to hyperparameters for which training diverges.

274

2K

10K

Jascha Sohl-Dickstein

@jaschasd

1 year

I’ve been daydreaming about an AI+audio product that I think recently became possible: virtual noise canceling headphones. I hate loud background noise -- BART trains, airline cabins, road noise, . 🙉. I would buy the heck out of this product, and would love it if it were built.

8

4

79

Jascha Sohl-Dickstein

@jaschasd

2 years

All the appointments are filled. Will see how the meetings go, and evaluate doing this again. I'm looking forward to finding out what people are interested in!.

1

0

11

Jascha Sohl-Dickstein

@jaschasd

2 years

I'm running an experiment, and holding some public office hours (inspired by seeing @kchonyc do something similar). Come talk with me about anything! Ask for advice on your research or startup or career or I suppose personal life, brainstorm new research ideas, complain about.

6

9

142

Jascha Sohl-Dickstein

@jaschasd

2 years

An excellent project making evolution strategies much more efficient for computing gradients in dynamical systems.

Oscar Li

@OscarLi101

2 years

📝Quiz time: when you have an unrolled computation graph (see figure below), how would you compute the unrolling parameters' gradients?. If your answer only contains Backprop, now it’s time to add a new method to your gradient estimation toolbox!

0

4

38

Jascha Sohl-Dickstein

@jaschasd

2 years

RT @mlbileschi_pub: 2+2=5?. “LLMs are not Robust to Adversarial Arithmetic” a new paper from our team @GoogleDeepMind with @bucketofkets, @….

0

11

0