Allan Zhou @AllanZhou17 profile

Allan Zhou

@AllanZhou17

Followers

1,221

Following

470

Media

36

Statuses

200

AI PhD student @Stanford .

https://t.co/bW4D1zYus8

Menlo Park, CA, USA

Joined March 2022

Don't wanna be here? Send us removal request.

Explore tweets Explore followers Explore following

Explore trending content on Musk Viewer

ALL EYES ON RAFAH • 796778 Tweets

De Niro • 387631 Tweets

Palestino • 103996 Tweets

Fernando • 92116 Tweets

#WWENXT • 56714 Tweets

Thomsen • 54044 Tweets

#خادم_الحرمين_الشريفين • 50758 Tweets

Millonarios • 38365 Tweets

Lali • 37777 Tweets

Coronado • 36152 Tweets

Lala • 35490 Tweets

Renê • 34261 Tweets

DAME MIL FURIAS • 34159 Tweets

Paulinho • 33737 Tweets

Peñarol • 30256 Tweets

$BOOST • 26582 Tweets

Wesley • 18251 Tweets

Sudamericana • 16133 Tweets

Josh Gibson • 14829 Tweets

Jordynne Grace • 13047 Tweets

#PumpRules • 11494 Tweets

All Ego

Dwight Powell

Ayrton Lucas

Varela

Lovera

Luiz Araújo

Scott Foster

台風一過

Exum

Bustos

Coudet

Panarin

Lorran

直言さん

はまこじインライ

Sol Perez

Guercio

花江ちゃん

Belgrano

Booker T

#اسراييل_ارهابيه

Gamero

Arrascaeta

David Luiz

Ben Brown

#MasterChefBR

Ethan Page

Bruno Henrique

Last Seen Profiles

@Twibault

@TaaseFaumui

@atreidsvie

@timotchola

@hw12390319754

@SMGxSS

@rhiray193

@TheIHI

@JFSindel

@everwiilow

@llillom20

@miraypehliivan

@braelyn_cesto

@mcdanceypants

@blablavvd

@EnzoOrtega719

@MonksDana

@TonhaLua

@hueningkaimeow

@spoon_qa

Pinned Tweet

Allan Zhou

@AllanZhou17

3 months

🧵: How do you design a network that can optimize (edit, transform, ...) the weights of another neural network? Our latest answer to that question: *Universal* Neural Functionals (UNFs) that can process the weights of *any* deep architecture.

5

116

805

Allan Zhou

@AllanZhou17

1 year

My reaction when I hear NLP folk talking a lot about RL recently.

4

73

836

Allan Zhou

@AllanZhou17

1 year

How can we design architectures that can process or transform neural networks? We introduce a framework of *Neural Functionals* that process NN weights while respecting their permutation symmetries. Paper: 1/9

3

79

384

Allan Zhou

@AllanZhou17

4 months

academics advertising their latest finetuning objective to anime pfps on twitter

6

21

314

Allan Zhou

@AllanZhou17

1 year

"This just proves that tests are only about memorization, not reasoning or understanding."

0

3

59

Allan Zhou

@AllanZhou17

6 months

I'll be presenting this work at NeurIPS next week, as well as NFTs (not that kind--). DM me if you're around and want to chat.

Neural Functional Transformers

The recent success of neural networks as implicit representation of data has driven growing interest in neural functionals: models that can process other neural networks as input by operating...

arxiv.org

Allan Zhou

@AllanZhou17

1 year

How can we design architectures that can process or transform neural networks? We introduce a framework of *Neural Functionals* that process NN weights while respecting their permutation symmetries. Paper: 1/9

3

79

384

0

8

57

Allan Zhou

@AllanZhou17

3 months

Paper: Most of this work was done at Google DeepMind, with amazing mentorship by @chelseabfinn and @jmes_harrison . Also, thanks to the TPU Research Cloud (TRC) for providing additional compute!

Universal Neural Functionals

A challenging problem in many modern machine learning tasks is to process weight-space features, i.e., to transform or extract information from the weights and gradients of a neural network....

arxiv.org

1

2

55

Allan Zhou

@AllanZhou17

4 months

After the recent plagiarism-mania, I've decided to err on the side of caution #ICML2024

1

44

Allan Zhou

@AllanZhou17

4 months

Coming to #ICLR2024 : what's the best way to extract information from implicit neural representations (INRs)? We show that tri-planes alone are enough to effectively classify or segment 3D objects in neural fields. 🧵

1

6

41

Allan Zhou

@AllanZhou17

1 year

so when are we getting the _real_ gpt4?

1

35

Allan Zhou

@AllanZhou17

17 days

If the world were truly bitter, NAS/AutoML/etc would have worked. But the best optimizers/architectures/hparams are still found by humans, not meta-methods that leverage compute. Sad.

4

0

35

Allan Zhou

@AllanZhou17

3 months

Check out our library to build your own UNFs! It can actually build permutation equivariant models for *any* collection of tensors, weights or not. You describe the permutation symmetries, and it gives you equivariant layers:

GitHub - AllanYangZhou/universal_neural_functional

Contribute to AllanYangZhou/universal_neural_functional development by creating an account on GitHub.

github.com

1

2

30

Allan Zhou

@AllanZhou17

19 days

ilya knows that it's time for weight-space architectures to shine :>

1

3

31

Allan Zhou

@AllanZhou17

6 months

@JosephPolitano

0

27

Allan Zhou

@AllanZhou17

6 months

What openAI needs now is a strong, ethical leader who won't get embroiled in any controversies

OpenAI

@OpenAI

6 months

OpenAI announces leadership transition

4K

14K

0

3

26

Allan Zhou

@AllanZhou17

3 months

NN weight spaces are riddled with permutation symmetries, which we have to account for. The original approaches (DWS&NFN) could only handle feedforward symmetries, while UNFs can handle the symmetries of any arch (e.g., RNNs & Transformers).

1

24

Allan Zhou

@AllanZhou17

6 months

@archit_sharma97 @ericmitchellai 1600-1970: share scientific results in letters and private correspondence 1970-2022: journal pubs and "peer review" 🤢 2022+: post scientific results directly to X, alongside memes shitposts and flamewars

2

4

25

Allan Zhou

@AllanZhou17

3 months

In the paper, we use UNFs to create "architecture-aware" learned optimizers--optimizers that "know" the symmetry structure of the weight space they optimize:

1

0

19

Allan Zhou

@AllanZhou17

1 year

We’ve released a PyTorch library for building NFNs, available through “pip install nfn”. Code: . With @kaien_yang , @kaylburns , @yidingjiang , @ssokota , @zicokolter , @chelseabfinn . 9/9

GitHub - AllanYangZhou/nfn: NF-Layers for constructing neural functionals.

NF-Layers for constructing neural functionals. Contribute to AllanYangZhou/nfn development by creating an account on GitHub.

github.com

3

1

18

Allan Zhou

@AllanZhou17

5 months

my feed right now. the two kinds of AI researchers?

1

0

16

Allan Zhou

@AllanZhou17

6 months

@natolambert here's quick and messy writeup

2

0

16

Allan Zhou

@AllanZhou17

1 year

Our neural functional networks (NFNs) operate on weight space features, just like a CNN operates on spatial features. NFNs consist of NF-Layers that are equivariant to the permutation symmetries of NN weight spaces, analogous to translation equivariance in conv layers. 2/9

1

0

15

Allan Zhou

@AllanZhou17

1 year

We’re really excited about applications of neural functionals to INRs, learned optimization, and pruning. In a similar direction, check out recent work by @avivnav characterizing equivariant layers for MLPs and their expressivity in the HNP setting: 8/9

Haggai Maron

@HaggaiMaron

1 year

(1/10) New paper! A deep architecture for processing (weights of) other neural networks while preserving equivariance to their permutation symmetries. Learning in deep weight spaces has a wide potential: from NeRFs to INRs; from adaptation to pruning 👇

8

130

742

1

0

15

Allan Zhou

@AllanZhou17

3 months

I really like the idea of training on datasets with an older cutoff date and testing on more recent discoveries to test reasoning (and weak-to-strong generalization). Like back-testing for AGI.

Martin Bauer

@martinmbauer

3 months

Has anyone tried to train AI with only 19th century sources and see whether it can discover special relativity?

91

184

3K

0

15

Allan Zhou

@AllanZhou17

1 year

NF-Layers are linear layers whose input and output are weight space features, and they satisfy NN permutation equivariance through parameter sharing. (Weight space features typically have multiple channels, but we show the 1-channel case here for simplicity.) 3/9

1

0

13

Allan Zhou

@AllanZhou17

4 months

Some cool recent publications on equivariant GNNs for processing weights/gradients/etc. Roughly, weights are just edges between neurons, so message passing NNs can process and update the edge features for you...

1

0

14

Allan Zhou

@AllanZhou17

5 months

"TF-datasets isn't worth the trouble, I'll write my own dataloaders. It's just some jsonl files." > 2 wks of multiprocessing bugs later:

0

1

14

Allan Zhou

@AllanZhou17

1 year

Yeah, we work on AGI (Amphibious Gaze Improvement).

Tony Z. Zhao

@tonyzzhao

1 year

With the advent of AGI, humans will soon be the weakest link in software industry. How can we have better coding buddies that *enhance* humans? Introducing 𝐁ug 𝐀nalysis and 𝐈dentification with enhanced 𝐓oads (BAIT), where we fit toads with contact lenses to better catch bugs

11

37

290

0

14

Allan Zhou

@AllanZhou17

6 months

@archit_sharma97 @ericmitchellai we're exiting the dark ages, science is so back

1

13

Allan Zhou

@AllanZhou17

1 year

We also train NFNs to edit INR weights to produce visual changes, like image dilation: 6/9

1

0

9

Allan Zhou

@AllanZhou17

1 year

We define two NF-Layer variants (NP and HNP) with different levels of parameter sharing. The NP variant is especially efficient: if you’re processing an L-layer NN, then it uses just O(L^2) parameters, regardless of the hidden widths or input/output sizes! 4/9

1

0

9

Allan Zhou

@AllanZhou17

1 year

We also find that on a generalization prediction benchmark, NFNs improve upon previous approaches for predicting the test accuracy of CNN classifiers: 7/9

1

0

8

Allan Zhou

@AllanZhou17

1 year

NFNs let us treat Implicit Neural Representations (INRs) as datasets, with the weights of each INR as a single data point. For example, NFNs can do “image classification” by classifying the weights of INRs trained on each image. 5/9

1

0

7

Allan Zhou

@AllanZhou17

6 months

@natolambert If the derivation of Eq 4 is a bit esoteric, there's always the more direct (but tedious) approach: form the Lagrangian L(π)=E[r(y)]-βKL(π||π_ref)+λ(∑π(y)-1) and set dL/dπ(y)=0, solve.

1

0

7

Allan Zhou

@AllanZhou17

5 months

@davikrehalt @abacaj these days when I see a twitter post showing examples of failed LLM reasoning, I get the answer wrong myself ~50% of the time

2

1

6

Allan Zhou

@AllanZhou17

18 days

anyone who's ever watched a romcom knows this ends with the AI matchmaker becoming the love interest

Tsarathustra

@tsarnick

19 days

Bumble founder Whitney Wolfe Herd says the future of dating is having your AI date other people's AI and recommend the best matches for you to meet

2K

586

4K

0

7

Allan Zhou

@AllanZhou17

4 months

@FSchaipp I didn't know this until reading . Decoupling wd from alpha seems helpful for LR stability

0

6

Allan Zhou

@AllanZhou17

2 months

@sp_monte_carlo Looking at such diagrams are very useful for figuring out the permutation symmetries of the NN's weight space 😛

0

4

Allan Zhou

@AllanZhou17

3 months

Copilot still doesn't know basic Equinox APIs (tries to use nn.Dense instead of nn.Linear). It's not that new, I guess Copilot just struggles with lesser known libs?

3

0

5

Allan Zhou

@AllanZhou17

5 months

meddy is OOD

Allan Zhou

@AllanZhou17

5 months

@OwainEvans_UK cool idea! the failures can be pretty egregious

0

4

Allan Zhou

@AllanZhou17

6 months

@rm_rafailov @EugeneVinitsky Yeah, DPO was the death blow

1

0

4

Allan Zhou

@AllanZhou17

6 months

Blondie laughs in the face of your "prediction markets"

TIME

@TIME

6 months

Taylor Swift ( @taylorswift13 ) is TIME's 2023 Person of the Year

5K

27K

94K

0

4

Allan Zhou

@AllanZhou17

6 months

@nabeelqu i've used anki for many years, but for technical content many things do require a type of practice that isn't amenable to flash card form (e.g., actually writing out code or solving practice problems in math).

2

0

4

Allan Zhou

@AllanZhou17

4 months

@yacineMTB this is why everyone should just annotate shapes in code

1

0

4

Allan Zhou

@AllanZhou17

5 months

IIUC the actual Einstein notation should only express coordinate-free calculations. But np.einsum can calculate the diagonal, hadamard prod, etc. Is there a precise description of what einsum can/can't do?

2

0

2

Allan Zhou

@AllanZhou17

4 months

Processing tri-planes works a lot better than processing raw INR weights, with basically no negative impacts on the original INR training process.

1

0

1

Allan Zhou

@AllanZhou17

6 months

@natolambert Ah, it's a sum over x weighted by p(x)>=0, so we just want to maximize each term independently wrt to π[·|x]. It's purely to simplify notation--we can calculate dL/dπ[y,x] (thinking of π as a matrix instead of a vector), just messier.

1

0

3

Allan Zhou

@AllanZhou17

6 months

Are the export controls actually doing anything? Seems like China is at the frontier for (open) pretraining.

1

0

3

Allan Zhou

@AllanZhou17

17 days

@EugeneVinitsky Yeah, I go back and forth on whether we need to (1) find a better design for the search space in these meta-methods, or (2) we just need to wait 30 years for more compute

1

0

3

Allan Zhou

@AllanZhou17

3 months

@PandaAshwinee @JacquesThibs See Sec 3.3 (lottery tix) in this paper. TLDR: it worked on small networks, but was learning a very simple magnitude pruning rule. Probably need to scale up to see more interesting behavior, but the data is expensive to generate. .

Permutation Equivariant Neural Functionals

This work studies the design of neural networks that can process the weights or gradients of other neural networks, which we refer to as neural functional networks (NFNs). Despite a wide range of...

arxiv.org

1

3

Allan Zhou

@AllanZhou17

4 months

And as @dereklim_lzh pointed out to me, the NFN layer equation is basically doing message passing operations. Yet I didn't see the resemblance earlier 😅

0

3

Allan Zhou

@AllanZhou17

5 months

@josephreisinger At least we've been humbled a bit since 2016 😂

0

2

Allan Zhou

@AllanZhou17

5 months

@sp_monte_carlo are there any good (casual) intros to RG methods for a more stats/ML audience?

2

1

2

Allan Zhou

@AllanZhou17

6 months

@nabeelqu and for practice problems, sure you can add them as cards but it's boring to see the same problem over and over. maybe soon we can have AI automatically rewrite variants of a problem for each next review.

0

3

Allan Zhou

@AllanZhou17

7 months

Alcubierre drive is next

Rafael Rafailov

@rm_rafailov

7 months

DPO is now used for production-quality models!

0

2

7

0

3

Allan Zhou

@AllanZhou17

2 months

@gallabytes Orbax checkpointing (and loading) usually works for me, is there a hidden flaw?

1

0

3

Allan Zhou

@AllanZhou17

3 months

@JacquesThibs Maybe! Haven't followed interpretability lit too closely. We've tried to predict useful "subnetworks" (as in lottery tix) in the past, want to explore that more. Could also learn correlations btw neuron activations, which might be relevant.

1

0

3

Allan Zhou

@AllanZhou17

6 months

@g_k_swamy @wgussml What makes RM+PPO more interactive? One could also iteratively train DPO no (by successively replacing the ref model)?

1

0

2

Allan Zhou

@AllanZhou17

4 months

@davikrehalt @QuanquanGu Bard actually got the answer wrong on purpose to demonstrate that it's not memorizing.

0

2

Allan Zhou

@AllanZhou17

6 months

@sp_monte_carlo I suspect most of the benefit of CNNs were computational: weight sharing + small 3x3 filters can efficiently process huge 512-chan feature maps, compared to MLPs.

1

0

2

Allan Zhou

@AllanZhou17

5 months

@sp_monte_carlo darn, I struggle when reading physics books :(

1

0

2

Allan Zhou

@AllanZhou17

5 months

@g_k_swamy @archit_sharma97 I like how Ng himself did some IRL work back in the day, but doesn't really bring it up as far as I've seen

0

2

Allan Zhou

@AllanZhou17

1 year

@ericjang11 @ylecun I haven't been able to get self-critique to work on this problem *unless* the original prompt contains @stanislavfort 's "LeCun trick," which imo provides a strong hint that it is a trick question.

1

0

2

Allan Zhou

@AllanZhou17

5 months

@robinhanson Though I'm inclined to believe the result, the latter half of the abstract makes so many unsubstantiated prescriptions it's hard to take the paper seriously.

0

1

Allan Zhou

@AllanZhou17

17 days

@shxf0072 @polynoamial yes, alphago was a much better demonstration of the bitter lesson than modern LLMs are imo

0

2

Allan Zhou

@AllanZhou17

1 year

can't be sure this is fake but it's almost too convenient for a certain narrative right now...

1

0

2

Allan Zhou

@AllanZhou17

10 months

honestly openai should just own its hallucinations, who cares?

Kevin Schawinski

@kevinschawinski

11 months

The @FTC is investigating @OpenAI and the document outlining their questions is fascinating. 🧵Some highlights:

75

648

2K

0

2

Allan Zhou

@AllanZhou17

6 months

@natolambert To add, there are no constraints linking π[·|x] for different values of x, so the problem separates over x, with one term p(x)(\sum_y [...]) per x.

0

2

Allan Zhou

@AllanZhou17

5 months

@nic_kup @bgavran3 Maybe related but you can use statphys methods to study double descent in linear models (and maybe some kernel methods). E.g.,

1

0

2

Allan Zhou

@AllanZhou17

5 months

@archit_sharma97 👑

0

2

Allan Zhou

@AllanZhou17

5 months

I'm pretty sure there are things you can't do with einsum that you can with eindex-scatter, eg:

Assign to diagonal · arogozhnikov einops · Discussion #291

Suppose X is a rank-5 tensor and y a matrix. The operation I want to accomplish is: X[i,i,i,j,k] += y[j,k] for all values of indices i,j,k. That is, I want to broadcast the matrix y along the "...

github.com

1

0

1

Allan Zhou

@AllanZhou17

17 days

@aryaman2020 FLOPs go up 🙏

0

2

Allan Zhou

@AllanZhou17

6 months

@natolambert Oops, a bit too messy. Should be λ=β(1-log Z). Also L(π,λ).

0

2

Allan Zhou

@AllanZhou17

5 months

@stammertescu @sp_monte_carlo this looks promising, thanks!

0

1

Allan Zhou

@AllanZhou17

3 months

@ThomasW423 @chelseabfinn Similar! I tend to think of hnets as generating weights, while unfs process them. Permutation symmetries become really important for the latter

1

0

2

Allan Zhou

@AllanZhou17

1 year

@NVIDIAAIDev @nvidia InstantNGP is an awesome tool, and was a huge time saver for this project!

0

2

Allan Zhou

@AllanZhou17

4 months

global AI talent tracker

0

1

Allan Zhou

@AllanZhou17

3 months

@rtaori13 Oh, and EasyLM. They have simple leightweight data loading for jsonl files

0

1

Allan Zhou

@AllanZhou17

8 months

@AlbertQJiang @davikrehalt @Yuhu_ai_ @jimmybajimmyba Naive q: is there a particular reason to view a tactic as an action, rather than a token? In code generation, I've seen papers adopt the latter. Perhaps it doesn't make a difference in practice?

1

0

Allan Zhou

@AllanZhou17

6 months

@YingXiao @finbarrtimbers Equinox uses the pytrees abstraction most cleanly imo. But even there pytrees can be very annoying, e.g. when we just want to modify a property of a single layer.

1

0

1

Allan Zhou

@AllanZhou17

3 months

@urusualskeptic Ah, good idea. But I haven't gotten any variant of that to work yet either. E.g.: "import equinox.nn as eqx_nn".

1

0

1

Allan Zhou

@AllanZhou17

5 months

@stammertescu @sp_monte_carlo Yeah, I prob just lack bg knowledge for most treatments. E.g., there's pretty good stuff on the replica trick for an ML audience. But not for RG for some reason.

1

0

1

Allan Zhou

@AllanZhou17

3 months

@RamanDutt4 @chelseabfinn @jmes_harrison Yeah! We don't run any exps like that but @avivnav trained DWS to transfer classifiers btw domains: check out 7.1 of

Equivariant Architectures for Learning in Deep Weight Spaces

Designing machine learning architectures for processing neural networks in their raw weight matrix form is a newly introduced research direction. Unfortunately, the unique symmetry structure of...

arxiv.org

1

0

1

Allan Zhou

@AllanZhou17

5 months

I'm really curious if we've benchmarked *median* human performance on all these LLM reasoning tasks. Not academics--median humans. Something like , which was for vision.

Partial success in closing the gap between human and machine vision

A few years ago, the first CNN surpassed human performance on ImageNet. However, it soon became clear that machines lack robustness on more challenging test cases, a major obstacle towards...

arxiv.org

Andy Jiang

@davikrehalt

5 months

@abacaj can you check human baseline performance for these questions lol

2

0

31

0

1

Allan Zhou

@AllanZhou17

4 months

Weird that the first papers on equivariant weight-space architectures (DWS and NFNs) didn't focus on the graph perspective, even though everyone typically visualizes an MLP as a graph.

1

0

1

Allan Zhou

@AllanZhou17

3 months

@rtaori13 maxtext or levanter? I would shill midGPT too, but our open code only supports OWT (same as nanoGPT)

1

0

1

Allan Zhou

@AllanZhou17

4 months

@mallocmyheart a rare book where the student should study and memorize the cover

1

0

1

Allan Zhou

@AllanZhou17

5 months

@davis_yoshida @DataSciFact I sometimes take for granted how much easier training is now. In 2010, it was probably pretty difficult to train deep networks stably.

0

1

Allan Zhou

@AllanZhou17

6 months

@sp_monte_carlo Relatedly, locally connected (no equivariance but sparse filters) can sometimes do roughly as well as CNNs. But no weight sharing -> uses more memory.

0

1

Allan Zhou

@AllanZhou17

3 months

@SamuelAinsworth @EpisodeYang I'm a bit OOTL, does 3DGS handle dynamic scenes/moving objects now? Otherwise, seems like most of the artifacts would just be from motion.

1

0

Allan Zhou

@AllanZhou17

5 months

@deliprao @mraginsky Yeah heres a link to some good introductory material /s

0

1

Allan Zhou

@AllanZhou17

3 months

@ThomasW423 @chelseabfinn Good point, in the context of editing/optimizing, we can say UNFs are a special type of hnet. But UNFs (and DWS/NFNs) can also extract info: give it the weights of a 3D INR and ask it to classify the 3D object they encode.

0

1

Allan Zhou

@AllanZhou17

6 months

@alexfmckinney @__kolesnikov__ @xhluca Thanks! I think I used this once briefly but couldn't understand the output, I'll try looking more carefully this time. Another indirect method is to profile and look at the XLA ops in the TB trace viewer, though it's also a bit confusing.

1

0

1

Allan Zhou

@AllanZhou17

6 months

@__kolesnikov__ @alexfmckinney @xhluca How do we check what the compiler is actually doing under the hood? Jax makes it easy to shard parameters+data going into the computation, but it's not clear how to see what's actually happening after JIT.

2

0

1

Allan Zhou

@AllanZhou17

6 months

@jxbz @TheGregYang Cool work! Your slides mention that for depth scaling we want ||dW||/||W||~(1/L), does muP achieve that? Or is there a parameterization that does?

1

0

1

Allan Zhou

@AllanZhou17

5 months

@YananLong @BlancheMinerva TPU v3-8's typically, I think. It can be hard to launch one on-demand, but preemptibles are fairly available.

1

0

1

Allan Zhou

@AllanZhou17

4 months

Awesome work led by @AdrianoCardace : .

Neural Processing of Tri-Plane Hybrid Neural Fields

Driven by the appealing properties of neural fields for storing and communicating 3D data, the problem of directly processing them to address tasks such as classification and part segmentation has...

arxiv.org

0

1

Allan Zhou

@AllanZhou17

16 days

@akbirthko awesome, where is this from?

1

0

1

Allan Zhou

@AllanZhou17

16 days

@ericmitchellai @akbirthko not sure but when considering methods like search (rather than learning), I think compute is the right variable

1

0

1