Jascha Sohl-Dickstein Profile
Jascha Sohl-Dickstein

@jaschasd

Followers
23K
Following
1K
Media
75
Statuses
544

Member of the technical staff @ Anthropic. Most (in)famous for inventing diffusion models. AI + physics + neuroscience + dynamics.

San Francisco
Joined August 2009
Don't wanna be here? Send us removal request.
@jaschasd
Jascha Sohl-Dickstein
3 years
My first blog post ever! Be harsh, but, you know, constructive. Too much efficiency makes everything worse: overfitting and the strong version of Goodhart's law. đź§µ
Tweet media one
Tweet media two
42
187
987
@jaschasd
Jascha Sohl-Dickstein
19 days
I will be attending ICML next week. Reach out (by email) if you'd like to chat! About Anthropic / research / life. I'm especially interested in meeting grad students who can teach me new research ideas.
7
10
273
@jaschasd
Jascha Sohl-Dickstein
3 months
This is great, hearing Yang's thought process and motivations for his score matching/diffusion research. (I had forgotten that I tried to convince him that score matching was too local to be useful for generative modeling :/).
@slaterstich
Slater Stich
3 months
Very excited to share our interview with @DrYangSong. This is Part 2 of our history of diffusion series — score matching, the SDE/ODE interpretation, consistency models, and more. Enjoy!
8
11
112
@jaschasd
Jascha Sohl-Dickstein
6 months
Slater is an excellent interviewer. This was a lot of fun to do. I'm even more excited for the upcoming interviews with @DrYangSong and @sedielem !.
@slaterstich
Slater Stich
6 months
Very excited to share our interview with @jaschasd on the history of diffusion models — from his original 2015 paper inventing them, to the GAN "ice age", to the resurgence in diffusion starting with DDPM. Enjoy!
1
6
72
@jaschasd
Jascha Sohl-Dickstein
1 year
This is an excellent paper, that ties many threads together around scaling models and hyperparameters.
3
3
55
@jaschasd
Jascha Sohl-Dickstein
1 year
This was one of the most research-enabling libraries I used at Google. If you want to try out LLM ideas with a simple, clean, JAX codebase, this is for you.
@peterjliu
Peter J. Liu
1 year
We recently open-sourced a relatively minimal implementation example of Transformer language model training in JAX, called NanoDO. If you stick to vanilla JAX components, the code is relatively straightforward to read -- the model file is <150 lines. We found it useful as a.
1
6
77
@jaschasd
Jascha Sohl-Dickstein
1 year
This was a fun project!. If you could train an LLM over text arithmetically compressed using a smaller LLM as a probabilistic model of text, it would be really good. Text would be represented with far fewer tokens, and inference would be way faster and cheaper. The hard part is.
@noahconst
Noah Constant
1 year
Ever wonder why we don’t train LLMs over highly compressed text? Turns out it’s hard to make it work. Check out our paper for some progress that we’re hoping others can build on. With @blester125, @hoonkp, @alemi, Jeffrey Pennington, @ada_rob, @jaschasd.
3
10
103
@jaschasd
Jascha Sohl-Dickstein
1 year
RT @trishume: Here's Claude 3 Haiku running at >200 tokens/s (>2x as fast as prod)! We've been working on capacity optimizations but we can….
0
36
0
@jaschasd
Jascha Sohl-Dickstein
1 year
I don't have a SoundCloud, but I did join Anthropic last week, and so far it has exceeded my (high) expectations. I would strongly recommend working there (and using Claude). *this project not done at Anthropic -- this was recreational machine learning on my own time.
12
10
323
@jaschasd
Jascha Sohl-Dickstein
1 year
The best performing hyperparameters are typically at the edge of stability -- so when you optimize neural network hyperparameters, you are contending with hyperparameter landscapes that look like this.
22
22
418
@jaschasd
Jascha Sohl-Dickstein
1 year
So it shouldn't (post-hoc) be a surprise that hyperparameter landscapes are fractal. This is a general phenomenon: in these panes we see fractal hyperparameter landscapes for every neural network configuration I tried, including deep linear networks.
Tweet media one
7
29
485
@jaschasd
Jascha Sohl-Dickstein
1 year
In both cases the function iteration can produce outputs that either diverge to infinity or remain happily bounded depending on those hyperparameters. Fractals are often defined by the boundary between hyperparameters where function iteration diverges or remains bounded.
2
3
243
@jaschasd
Jascha Sohl-Dickstein
1 year
There are similarities between the way in which many fractals are generated, and the way in which we train neural networks. Both involve repeatedly applying a function to its own output. In both cases, that function has hyperparameters that control its behavior.
5
13
372
@jaschasd
Jascha Sohl-Dickstein
1 year
The boundary between trainable and untrainable neural network hyperparameter configurations is *fractal*! And beautiful!. Here is a grid search over a different pair of hyperparameters -- this time learning rate and the mean of the parameter initialization distribution.
24
167
1K
@jaschasd
Jascha Sohl-Dickstein
1 year
Have you ever done a dense grid search over neural network hyperparameters? Like a *really dense* grid search? It looks like this (!!). Blueish colors correspond to hyperparameters for which training converges, redish colors to hyperparameters for which training diverges.
274
2K
10K
@jaschasd
Jascha Sohl-Dickstein
1 year
I’ve been daydreaming about an AI+audio product that I think recently became possible: virtual noise canceling headphones. I hate loud background noise -- BART trains, airline cabins, road noise, . 🙉. I would buy the heck out of this product, and would love it if it were built.
8
4
79
@jaschasd
Jascha Sohl-Dickstein
2 years
All the appointments are filled. Will see how the meetings go, and evaluate doing this again. I'm looking forward to finding out what people are interested in!.
1
0
11
@jaschasd
Jascha Sohl-Dickstein
2 years
I'm running an experiment, and holding some public office hours (inspired by seeing @kchonyc do something similar). Come talk with me about anything! Ask for advice on your research or startup or career or I suppose personal life, brainstorm new research ideas, complain about.
6
9
142
@jaschasd
Jascha Sohl-Dickstein
2 years
An excellent project making evolution strategies much more efficient for computing gradients in dynamical systems.
@OscarLi101
Oscar Li
2 years
📝Quiz time: when you have an unrolled computation graph (see figure below), how would you compute the unrolling parameters' gradients?. If your answer only contains Backprop, now it’s time to add a new method to your gradient estimation toolbox!
Tweet media one
0
4
38
@jaschasd
Jascha Sohl-Dickstein
2 years
RT @mlbileschi_pub: 2+2=5?. “LLMs are not Robust to Adversarial Arithmetic” a new paper from our team @GoogleDeepMind with @bucketofkets, @….
0
11
0