Brian L Trippe Profile
Brian L Trippe

@brianltrippe

Followers
1,684
Following
452
Media
8
Statuses
92

Bayesian statistics at @Columbia , machine learning for protein design @UWproteindesign

New York City, NY
Joined December 2016
Don't wanna be here? Send us removal request.
Explore trending content on Musk Viewer
@brianltrippe
Brian L Trippe
18 days
I am pleased to share that I have accepted a position @Stanford to start as an assistant professor in the department of statistics, with an affiliation in @StanfordData , this fall!
85
29
937
@brianltrippe
Brian L Trippe
2 years
Can ML build diverse protein scaffolds around functional motifs? We find equivariant diffusion models and Sequential Monte Carlo may help! Joint work with Jason Yim, and coauthors Doug Tischer, @ta_broderick , David Baker, @BarzilayRegina and Tommi Jaakkola
8
91
312
@brianltrippe
Brian L Trippe
2 years
How can we collect good enough data for machine learning driven protein design? We show that random numbers are part of the picture. Work with the David Baker lab (including @erika_alden_d ) and MSRNE (with @KevinKaichuang and @lorin_crawford ). (1/4)
2
30
139
@brianltrippe
Brian L Trippe
1 year
Working on this project has been a blast and we're thrilled to finally share it. An exciting future for protein design with RoseTTAFold Diffusion lies ahead!
@_JosephWatson
Joseph Watson
1 year
DALL-E’s amazing images are popping up all over the web. That software uses something called a diffusion model, which is trained to remove noise from static until a clear picture is formed. Turns out diffusion models can design proteins too!
34
530
2K
1
7
44
@brianltrippe
Brian L Trippe
5 years
I'm excited to share a new paper with @jhhhuggins , Raj Agrawal and @ta_broderick coming out @icmlconf ! We speed up Bayesian inference in high dimensional generalized linear models using low rank approximations of data, with a method we call "LR-GLM". 1/5
1
9
29
@brianltrippe
Brian L Trippe
10 months
Come check out our poster today at 4pm! We find that with a sequential Monte Carlo technique known as twisting we can use conditional sampling heuristics, like reconstruction guidance, to get exact inferences with additional compute.
@chris_naesseth
Christian Andersson Naesseth
10 months
Twisted Diffusion Sampling - Practical and Asymptotically Exact Conditional Sampling for Diffusion Models Theoretically grounded conditional sampling for diffusion models with applications to inpainting, Bayesian inverse problems, protein design, ...
2
8
45
0
3
28
@brianltrippe
Brian L Trippe
3 years
We ( @skdeshpande91 , @ta_broderick and myself) have a new pre-print out now! It’s called “Confidently comparing estimators with the c-value”.
Tweet media one
1
2
19
@brianltrippe
Brian L Trippe
2 years
The motif-scaffolding problem is central in protein design: given a target motif (a structural fragment conferring function) construct scaffolds that support it. We propose a generative modeling approach to this problem with two steps.
Tweet media one
1
3
17
@brianltrippe
Brian L Trippe
2 years
The first step is modeling a distribution over protein structures. We develop ProtDiff, which tailors equivariant graph neural networks for diffusion probabilistic models (DPMs) on protein backbones. ProtDiff samples diverse topologies, including structures not in the PDB.
Tweet media one
1
1
13
@brianltrippe
Brian L Trippe
2 years
The second step, SMCDiff, samples scaffolds from the conditionals of ProtDiff, given a motif. Naive inpainting fails to generate long scaffolds. So SMCDiff directly targets conditionals of DPMs. It’s the first method to guarantee DPM conditional samples in a large-compute limit.
Tweet media one
1
1
12
@brianltrippe
Brian L Trippe
2 years
We believe generative modeling is a promising direction in the motif-scaffolding problem. We plan to release code once we have made more progress. Check out our preprint to learn more!
2
1
12
@brianltrippe
Brian L Trippe
2 years
We model protein backbones only and leverage recent work on fixed-backbone sequence design. We use ProteinMPNN (link) to get sequences and validate them against AlphaFold predictions (without MSA) by TMScore. We call this metric self-consistency TM-Score (scTM).
1
1
10
@brianltrippe
Brian L Trippe
2 years
Our samples often agree with AlphaFold predictions (scTM > 0.5) and those that agree exhibit a diverse array of topologies across different lengths.
Tweet media one
1
1
9
@brianltrippe
Brian L Trippe
2 years
Pseudorandom numbers are crucial in computer science (e.g. cryptography) and statistics (e.g. randomization in trials), but rarely feature in biological assays. So it’s neat to find that they’re useful here too! (4/4)
1
1
6
@brianltrippe
Brian L Trippe
2 years
Tagging as well, @jyim0 , an equal driver of this work!
0
0
6
@brianltrippe
Brian L Trippe
2 years
But we can get rid of this bias if we use random numbers to flip a (biased) coin to choose how to bin each cell. We show how this works with a mix of statistical theory, simulations, and wet-lab experiments. (3/4)
1
1
5
@brianltrippe
Brian L Trippe
2 years
By adding randomness to sort-seq assays (fluorescence activated cell sorting + high-throughput sequencing) we can get precise multiplexed measurements. This solves an open problem in sort-seq assays: deterministic binning introduces systematic bias that limits precision. (2/4)
1
1
5
@brianltrippe
Brian L Trippe
5 years
Also check out our other ICML paper (led by Raj Agrawal) on "The Kernel Interaction Trick", which allows us to use Bayesian inference to efficiently identify pairwise interactions between covariates in regression models! 5/5
0
0
4
@brianltrippe
Brian L Trippe
3 years
You might wonder: wait, what about RISK, i.e. the loss averaged over all possible realizable datasets? Well it turns out we can construct T1 and T2 where (A) T2 has smaller risk than T1 but (B) T2 incurs larger loss than T1 on *the majority of datasets*
Tweet media one
1
0
4
@brianltrippe
Brian L Trippe
1 year
@_aliaabbas @ta_broderick @BarzilayRegina Yep (for github at least)! We're slowly working on it :)
0
0
1
@brianltrippe
Brian L Trippe
3 years
Basically, loss averaged over all possible but unrealized datasets may not be close to loss incurred on the single observed dataset. c-values try to answer a different question: which estimator works better on *my observed dataset*.
1
0
2
@brianltrippe
Brian L Trippe
3 years
Enter the c-value! It quantifies how confident we are that T2(Y) achieves smaller loss than T1(Y). Informally, a high c-value reassures us that using the comparatively more complicated T2(Y) results in smaller loss than using T1(Y). (Thm 2.2).
1
0
1
@brianltrippe
Brian L Trippe
3 years
We show c-values are useful for evaluating Bayesian estimates in a range of applications including hierarchical models of educational testing data at different schools, shrinkage estimates that utilize auxiliary datasets, and selecting between different GP kernels.
1
0
1
@brianltrippe
Brian L Trippe
3 years
Say you want to estimate an unknown parameter T* based on data Y. You have two potential estimates T1(Y) and T2(Y). T2 might be complicated (e.g. from a hierarchical model) and T1 is a more common baseline (e.g an MLE). When is it safe to abandon T1(Y) in favor of T2(Y)?
1
0
1
@brianltrippe
Brian L Trippe
3 years
Ever used a hierarchical model to estimate a parameter and wondered if you’re actually doing better than a simpler baseline like maximum likelihood? We present a method addressing this question!
1
1
1
@brianltrippe
Brian L Trippe
5 years
Unlike variational Bayes, this approximation is conservative; LR-GLM doesn't underestimate uncertainty. We also show how increasing computational budget increases the information extracted from data about the model. 3/5
1
0
1
@brianltrippe
Brian L Trippe
2 years
@ml4proteins @erika_alden_d @EliWeinstein6 Looking forward to seeing you there! Zoom link: password: ml4prot
0
1
1
@brianltrippe
Brian L Trippe
3 years
We can even use the c-value to choose between T1 and T2: if the c-value exceeds a confidence level alpha, use T2. We show that we rarely incur higher loss as a result of the data-based selection than if we had just stuck with the default T1 (Thm 2.3).
1
0
1
@brianltrippe
Brian L Trippe
3 years
To compute c-values, we construct lower bounds on the difference in loss L(T*, T1(Y)) - L(T*, T2(Y)) that hold uniformly w/ prob alpha. We demonstrate our construction w/ many examples including empirical Bayes shrinkage, Gaussian processes, & logistic regression.
1
0
1
@brianltrippe
Brian L Trippe
3 years
One idea: Use a loss function L (e.g. squared error) and report the estimate with lower loss! I.e. use the more complicated T2 iff L(T*, T2(Y)) < L(T*, T1(Y)). Problem: T* is unknown, so we can't operationalize this process.
1
0
1
@brianltrippe
Brian L Trippe
5 years
Additionally, we provide theoretical guarantees on approximation quality, with non-asymptotic bounds on approximation error of posterior means and uncertainties. 4/5
1
0
1
@brianltrippe
Brian L Trippe
5 years
With LR-GLM, we make inference with Laplace approximations and MCMC faster by up to full factor of the dimensionality. The rank of the approximation defines a trade-off between the computational demands and accuracy of the approximation. 2/5
1
0
1
@brianltrippe
Brian L Trippe
3 years
Finally, constructing the uniform bounds on L(T*, T1(Y)) - L(T*, T2(Y)) is challenging & sometimes we need approximations. Fortunately, @jhhhuggins helped guide us towards some non-asymptotic bounds for Gaussian models (Thm 4.1), and an extension to logistic regression (Sec 5.2)!
Tweet media one
1
0
1